Modern cryptographic hashes (SHA3, Blake2/3) are designed to be indistinguishable from a random oracle. And even MD5/SHA1/SHA2 should not be distinguishable by a statistical test suite.
So if any of them fails a test, it means that either:
1. A statistical test found a flaw crytographers did not yet discover (very unlikely, especially since multiple cryptographic hashes fail)
2. The test suite integrated/implemented the hash function incorrectly. This test suite seems to key the hash functions in an unsupported way, though I'm skeptical that's responsible for the failures.
3. The test suite is flawed and has a significant chance of producing test failures for an ideal hash function.
smhasher does not report any failure for blake3. The one failure for blake2b 256 is not ideal for a hash function, but not necessarily evidence that the function doesn't look like a random value: `Sparse` generated 50643 16-bit values, hashed them, and found 2 collisions in the high 32 bits of the output. I'm not sure what kind of flaw in the test harness you think can explain that.
There could definitely be issues in the integration code that lets the harness call into all these functions. For example, smhasher finds issues with SHA3 for the "PerlinNoise" input sets. That input set hashes small integers in [0, 4096), with seeds in [0, 4096); I'm not convinced the sha3 wrapper does anything useful with the seed here https://github.com/rurban/smhasher/blob/37cffd7b9cdaa2140c53... . I expect something similar is happening with SHA1 and SHA2.
The MD5 row shows no failure; only the function that truncates to the low 32 bit has failures.
You can read the test harness or the test log (e.g., https://github.com/rurban/smhasher/blob/master/doc/blake2b-2...) and apply your own significance threshold. The statistical tests are nothing special or novel (counting collisions in bitranges of the input, and bias in individual bits, mostly); the interesting part is how the various tests generate interesting sets of inputs. In the end, it's a bit like the PRNG wars: you can always come up with a test that makes a function look bad, but a ton of failure is definitely a bad sign.
Considering the number of tests and hash functions, the failed tests will simply be noise. The default config should use a much higher significance threshold, testing for a number of seeds and looking at the combined statistics would be one way of achieving that.
> The MD5 row shows no failure; only the function that truncates to the low 32 bit has failures.
Truncated MD5 should behave like an ideal 32-bit hash function.
> it's a bit like the PRNG wars: you can always come up with a test that makes a function look bad, but a ton of failure is definitely a bad sign.
For a CSPRNG you shouldn't be able to come up with any test that'll fail more often that you'd statistically expect from truly random numbers (averaging over random enough random seeds).
So if any of them fails a test, it means that either:
1. A statistical test found a flaw crytographers did not yet discover (very unlikely, especially since multiple cryptographic hashes fail) 2. The test suite integrated/implemented the hash function incorrectly. This test suite seems to key the hash functions in an unsupported way, though I'm skeptical that's responsible for the failures. 3. The test suite is flawed and has a significant chance of producing test failures for an ideal hash function.