It's pretty easy to audit a couple lines of unsafe in a library.
It is not easy to do the same with a C library where it's basically prone to overflow issues everywhere.
These are vastly different issues. Yes, we should totally be strict on unsafe code. No, it is not the end of the world when the stdlib hashmap uses unsafe code. Unsafe is designed exactly for this purpose, dealing with the innards of safe abstractions. It does it well, and the hashmap code is pretty ok here.
(Sure, if it has a design which could be done in safe Rust, it should, but I suspect with the way the robin hood stuff works it might not. this has been on my list of things to do when I get more time, primarily to just learn about the hashmap impl, but also to improve it if possible)
The main reason it requires unsafe is for memory layout optimizations. Logically, a hash table is an array of buckets, each of which is either a (hash, key, value) triplet or empty. The naive representation would be Vec<Option<(usize, K, V)>>, but using Option would significantly increase the size of each bucket: it's only a one-byte tag to differentiate between Some and None, but alignment constraints would round the overhead up to at least the alignment of usize, typically 8. Instead, HashMap stores just (usize, K, V), and considers a bucket empty whenever 0 is stored in the hash slot; if 0 ever comes up as an actual hash value, it's changed to a different value before being stored.
To the extent I've described it, that might be possible to accomplish in Rust without unsafe code, using a safe abstraction around NonZero (an unsafe type which the compiler assumes will never be zero, allowing it to automatically use a zero value as a marker if the type is found in an enum, such as Option). However, HashMap goes a step further: it actually stores all the hashes contiguously in one array, and the (K, V) pairs in another. The nth index in the hashes array and in the (K, V) array together represent a single bucket, but storing them separately makes HashMap's typical access patterns somewhat nicer on the CPU's cache. However, this means that slots in the (K, V) array can contain either valid data (of arbitrary types, which might contain pointers etc.) or garbage, depending on a value stored somewhere completely different. There's no good 'automatic' way to make that safe.
I think the use of unsafe code here is fine - it's not that hard to audit, and if you don't audit you're in trouble anyway (hash table DoS is also a security risk) - but in the long term, it would be very interesting if Rust could integrate an optional theorem prover, with the ability to prove arbitrarily complex code safe, given sufficient annotations...
> It's pretty easy to audit a couple lines of unsafe in a library.
You have to audit more than just the lines within the unsafe block. For example, I discovered a buffer overflow in a Rust library this week that was caused by an integer overflow outside the unsafe block[1]. The unsafe code itself was written correctly.
Sounds to me like the unsafe block had the wrong scope. If safe Rust called a function with a buffer, and that buffer was too small, and that function internally used unsafe code to write to the buffer, then that function is leaking the unsafety past the `unsafe {}` scope, which is incorrect. So that function should be marked as `unsafe`, and the calling code (which calculates the buffer) is then responsible for ensuring that it's invoking the unsafe function with an appropriately-sized buffer.
Not necessarily. This isn't a question with a single answer within the community; there are reasons to tightly scope unsafe, and there are reasons to scope it to the full extent of unsafe-affecting invariants.
Scope your unsafe however you want, but check the invariants when auditing, and make sure they don't escape the module.
It is not easy to do the same with a C library where it's basically prone to overflow issues everywhere.
These are vastly different issues. Yes, we should totally be strict on unsafe code. No, it is not the end of the world when the stdlib hashmap uses unsafe code. Unsafe is designed exactly for this purpose, dealing with the innards of safe abstractions. It does it well, and the hashmap code is pretty ok here.
(Sure, if it has a design which could be done in safe Rust, it should, but I suspect with the way the robin hood stuff works it might not. this has been on my list of things to do when I get more time, primarily to just learn about the hashmap impl, but also to improve it if possible)