It's not the highest of quality hash functions (see the SMHasher benchmarks), but it is fast. A great alternative is XXH3 (https://cyan4973.github.io/xxHash/), which has seen far more usage in practice.
I'm using XXHash3 for verifying data integrity of large (10+MB) blobs. Very fast and appears to work very well in my testing--it's never missed any bit error I've thrown at it.
Aside: when storing hashes, be sure to store the hash type as well so that you can change it later if needed, e.g. "xxh3-[hash value]". RFC-6920 also has things to say about storing hash types and values, although I haven't seen its format in common use.
> be sure to store the hash type as well so that you can change it later if needed
Thanks for sharing this, I'd been doing this on my own for my own stuff (eg. foo.txt-xxh32-ea79e094), but it's good to know someone else has thought it through.
I ran into the problem once where someone had named some files foo-fb490c or something similar without any annotation, and when there was a problem, it took a file to figure out they were using truncated sha256 hashes.
If you had made it one section into the analysis, you would have seen that at the time MeowHash made certain cryptographic claims that the author set out to disprove.
The readme has since been updated. I didn't check whether any algorithmic changes were made on top, but the discussion of the analysis on github didn't point to a lot of low-hanging fruit.
It's not useless analysis, because even for non-cryptographic hashes you want the likelihood of any arbitrary hash to be roughly equal. A hash function which "prefers" certain outputs has a far higher probability of collision.
Don't you think asset planting is an attack against a game's pipeline?
The author of the article's page claims the hash is not cryptographic but actually goes on to make security claims about the hash. People who do not understand cryptography should be careful about making such claims. The author appear to understand this more than your comment demonstrates.
For example, a claim about change detection is a cryptographic claim of detecting preimage attacks. In a threat model, a security professional would determine whether a first preimage or a second preimage attack is what should be guarded in attack scenarios. Then, the professional would help with analysis, determining mitigations, defense in depth, and prioritization of fixing the vulnerabilities exposed by how the hash is used.
A hash cannot be considered standalone. It is the architecture and use-case where the hash's security properties are used to determine what security properties of the application are fulfilled.
So, if the author is correct, which seems to be the case, then meowhash should not be used in a production environment outside of the most simplistic checks. It seems faster for its intended use case to simply check for a single bit difference between two images - no hash required.
Realistically hashes are used during the development of a game to detect when a file or asset hash changed, and therefore it will trigger regeneration of assets that depend on it. For some long or slow build steps it is better to rely on a hash changing than the timestamp changing to trigger a build. It can also be used to fetch pre-built data from where it's cached on a server.
That's why it should be fast and doesn't have to be cryptographic.
Regarding security if you're at the point of someone malicious creating a file on your internal systems you've already lost the battle.
> But then you have to store the entire before & after locally?
Yes, there is difference between two (as you say) and there is integrity (modification detection). In the case of comparing new assets in a pipeline to those that were created earlier, it sounds plausible both copies would be present.
> That's the entire point of using a hash for change detection.
This is called integrity protection. Change detection is the incorrect term to use here. Please see what I referenced earlier for first and second preimage.
What determines whether a hash is "cryptographic"? What would make it suitable for change-detection but not be "cryptographic"? Is the claim here that it would not be suitable for detecting "malicious" changes, but is still suitable for detecting "natural" changes?
A couple features are that is hard for an attacker to find a collision, and it is hard to reverse the original data from the hash. Both can cause serious security problems but aren't necessary in something like this where most image changes are "natural" and you gain speed by weakening those constraints.
Saying it's twice as fast is rather misleading? They can both hash as fast as RAM speed allows anyway. And if it's something in cache I doubt one is significantly better than the other.
I have measured (on a Skylake core) the hash from Daniel Lemire, Owen Kaser, "Faster 64-bit universal hashing using carry-less multiplications, Journal of Cryptographic Engineering (to appear)" at t_cycles = (0.00741 +- 0.00098 cycles/byte) * bytes + (37.43 +- 0.48) which is roughly 135 B/cycle (675 GB/s at 5 GHz) or over 2x faster than the claim here. Admittedly, that 37 cycle fixed cost is probably quite a bit higher, and I don't know how well the bit mixing compares.
Parenthetically, a time equation (like t_cycles = coef * bytes + fixedCost) is a more convenient way to summarize performance of this kind of thing than the constant short string/long string mumble-mumble. Yes, loop unrolling/etc. can create step-ish functions and a linear regression is pretty approximate (but my stderrs are pretty small above), but even so..much less tripping over words.
umash (https://github.com/backtrace-labs/umash) has a similar structure PH block structure (and similar universality guarantees), but was designed for decent bit mixing (enough to satisfy smhasher, unlike CLHASH, which needs an additional finalizer) with a lower fixed time cost: 22 cycles for a one-byte hash on a 2.5 GHz Xeon 8175M.
I'm not sure how one would use that linear regression. What kind of hardware offers 675 GB/s of memory bandwidth? 140 bytes/cycle is easily more than twice the L2 read bandwidth offered by any COTS chip I'm aware of. There are also warm up effects past the fixed cost of setup and finalizers that slow down hashing for short input. For what range of input sizes (and hot/cold cache state) would you say the regression is a useful model?
I was only doing short < 1024 byte strings in L1 (like URI lengths-ish). I'm well aware no memory system on typical HW can deliver that (EDIT: "at large space scale"). The article itself mentions a similar calculation (but at their slower scale) which is why I included it.
And yes, warm up absolutely matters. My numbers are for the best case path which is among the only easy things to even define, IMO. Worst and average are sooooo much harder/assumption-riddled. Then you argue a lot about said assumptions.
I do the minimum times of many runs in an inner loop and the regression on those min-time results. So, the min-loops will warm up everything like branch predictors, etc. I'd suspect the regression model would hold up to usable L1 cache (e.g. 30 KiB or so). Beyond that, it's mostly just measuring the memory system not the hash function..and I'm not sure what the point of that is. Separation of concerns would suggest L1 perf is most relevant to the "hash calculation part". (IMO, many benchmarks confuse this.)
And, yeah..modern memory systems suck. Simple back of the envelope would suggest some DIMM-oriented workload would be waiting on the DIMMs like 95% of the time, usually much higher on Xeons with their low single core BW (maybe 98%!). The Meow article sort of alludes to that as well with less detail. Single core Xeon BW is always a huge disappointment. Forces you to go (very) parallel just to saturate DIMMs (EDIT: which in turn cannot saturate even a single L1!).
BTW, it sounds like you know all of this (and maybe more) and I don't mean to suggest otherwise, but I thought some numbers/details might assist other passersby.
{ EDIT: Also, the stderr on that slope is ~13%. So, it's possible 135 is really 128 (1.05x lower), the 2 cache line/cycle load limit or something like that. This doesn't mean the 640 GB/s rate number is wrong - it just underscores how mismatched memory systems are at their various scales. }
{ EDIT: and yes, 128/16 = 8x faster than meow, not 2x. Oops. }
1/135 cycles per byte on Skylake is just plain impossible, even if the hash consisted of simply one xor per 32 bytes of input.
The lower bound for CLHASH would be the cost of one carryless multiplication per 16 bytes of input, or in other words 1/16 ~ 0.0625 cycles per byte, since you can only issue one such multiplication per cycle.
Feel free to measure it yourself instead of just speculating. (And as mentioned elsewhere, it is probably 1/128.) { EDIT: and I never measured meow - it could be faster than 1/16 cycle/byte but poorly assessed; just be sure to use a min loop and small data. }
Microbenchmark measurements that are faster than the memory system are frequently a symptom that the compiler optimized most or all of the unit under test into a no-op.
It isn't faster than the L1 memory system which is the most relevant thing to measure to compare hash algos, as already explained. (duplicated here for clarity/flow)
Also, I perhaps could more fully motivate my time equation parenthetical. With such an equation in hand, one can easily compare hash functions with differing trade offs, such as Wang Yi's hash and Farm Hash:
where this final number is how long strings need to be before Wang Yi's is faster than Farm. So, it is a nicely convenient summary. (And this is true even without my fancy statistical error estimates...)
Of course full charts/graphs/yadda-yadda can be even better, but just two summary numbers can also be nice.
> Because we have to process hundreds of gigabytes of art assets to build game packages, we wanted a fast, non-cryptographic hash for use in change detection and deduplication. We had been using a cryptographic hash (SHA-1), but it was unnecessarily slowing things down.
That's because you were doing a potentially silly thing: hashing every byte.
What does a hash tell you? If two hashes are different, it confirms that the objects are different. If two hashes are the same, they could be the same object or they could be different.
For that purpose, you don't have to hash everything: just hash the areas where objects are likely to differ.
If two 15 gigabyte files are likely to have differences in their first kilobyte, then hashing just first kilobyte (negligibly fast) will work just as well as hashing 15 gigabytes.
If two files are of different length, then they are not the same. How about:
For deduplication, a 32 bit CRC could probably be reasonably efficient. No matter what hash you use, you have to verify the collisions. If CRC32 is enough to make false positives rare over your type of data set, it's good enough.
"just hash the areas where objects are likely to differ"
Correct me if I'm wrong but given these are art assets, wouldn't "determining where the objects are likely to differ" itself be an expensive/hard problem?
Once you need to load any parts of a file into memory you might as well compute the hash for all the bytes that you've loaded into memory. The act of loading the bytes from permanent storage (even NVME SSD) is likely to be many orders of magnitude slower than actually computing the hash. So it makes no sense to just "hash parts of the file" when in all likelihood the entire file will be getting loaded into memory even if you actively read only a few bytes from each 4k block.
It really depends. The early bytes probably contain things like image dimensions which might discriminate a lot - or not at all. They might contain palette-style color information which might discriminate tons - or not at all. Most things with hash functions depend upon distributions of inputs relative to the function, though.
I think it would be accurate to say "Determining where they are likely to differ is not necessarily hard, but it could be." If it happens to be easy then you can radically change the scale of the problem.
File length alone is a weak hash actively maintained by any file system and so available "for free" (or one stat call). And it can work surprisingly well, as the grandparent alludes to.
EDIT: You actually do not need to even open(2) files that are not part of "same size clusters", never mind hash their data. This alone can really squash the problem size over a more naive approach even if you did feel the need to hash the whole file.
EDIT: And in terms of performance, if this data has to actually be read off a drive then the hash need not go any faster than the drive IO which is probably much, much slower than most candidate hash functions. Even PCIe5.0 NVMes are only like 14 GB/s, SATA buses like 0.75 GB/s, etc.
To avoid the hash being entirely oblivious to bytes beyond a small block at the start, we could take additional samples of the file at exponentially increasing offsets:
Hash the first kilobyte.
Hash the 16th kilobyte
Hash the 256th kilobyte
Hash the 4096th kilobyte (4M)
Hash the 65536th kilobyte (64M)
Hash the 1048576th kilobyte (1G)
Then 16G, 256G, ...
It might be worth including the last kilobyte. (I think I mentioned that.) If the file format has trailing metadata, that can help, especially if there is any sort of checksum in it.
Sure. I agree various sampling strategies are a good design, as long as they don't add up to much IO. Making that a "user parameter" to be customizable seems ideal. What you suggest may make a fine default - if adapted to varying file sizes somehow. E.g., that Nim dups program I mentioned elsewhere could just take a sequence of slices instead of only one..maybe just ignoring out of bounds slices. { And for size clusters where the size < N, screw it - just hash the whole thing, probably. ;-) }
>Finding sections of the file where data can be different is not only potentially hard but depends on the type of data. So can't be made generic
I agree, and a cousin comment of mine suggests user-specifiable sampling which can always be a full sample in the "hard case". Often, once one has identified true duplicate candidates, a full compare is desirable to not trust any hash, and that full compare uses a lot of IO. That depends on the user, too.
>Comparing files based on size is trivial and I presume people will always do it first before attempting to use a hash or even CRC.
It is unclear the meow guys did this since they seemed to motivate hash performance based on dedup and they unlikely had IO > 10GB/s in 2018, but it's hard to know for sure.
CRC32 is actually very slow in modern CPUs. So if you can use an algorithm that is both (1) faster and (2) better quality, why not use that algo instead?
I wouldn't categorize it as 'very slow', at worse 50% slower than 128bit XXH3 hash (and faster than the 32bit XXH hash).
Modern intel processors have the CRC32C instructions and ARM have both CRC32 and CRC32C. There is a reluctance in some of the hash comparisons to use the hardware CRC implementations for compatibility reasons, but most modern CPUs have them.
The main argument for using a hardware CRC32 (and to some extent a software one too) is simplicity. If your use-case needs decent (but not blinding) performance, then there is something to be said for a hash implementation that can be written in ~10 lines of code.
You still have to handle the possibility of collisions but any 32bit hash has that restriction, which is fine for many uses.
I know +1 is against the rules, but I think this deserves a positive comment. I enjoyed reading it, thanks for taking the time and giving back to the ecosystem.
I wish there was a consensus non-cryptographic hash algorithm. Something that hashes arbitrary data to 128 or 256 bit keys with good randomness, few collisions on typical input data, and universal implementation.
Most programmers I know reach for SHA-1 (or if we're being honest, MD5). But a cryptohash is really wasteful if you don't need the cryptographic properties.
Last I looked at this Murmurhash was the closest to widely used with some folks saying Metrohash was better. But that was years ago. Judging by this discussion xxHash is the new hotness?
You want to have XXH128 for that right? 128 bit, portable, virtually impossible to collide, and only slightly slower than XXH3 while still way faster than most options.
> I wish there was a consensus non-cryptographic hash algorithm
I think Siphash serves that role pretty well. It's one of the most if not the most secure among non-cryptographic hashes, and quite speedy, especially with SIMD implementations. One need only consider alternatives if one wants to trade off a bit of security for even higher speed.
Siphash is more like a crypto has right? According to smhasher its performance is somewhere between a non-crypto hash like xxHash or Meow and something like SHA-1.
Although designed for use as a hash function to ensure security, SipHash is fundamentally different from cryptographic hash functions like SHA in that it is only suitable as a message authentication code: a keyed hash function like HMAC. That is, SHA is designed so that it is difficult for an attacker to find two messages X and Y such that SHA(X) = SHA(Y), even though anyone may compute SHA(X). SipHash instead guarantees that, having seen Xi and SipHash(Xi, k), an attacker who does not know the key k cannot find (any information about) k or SipHash(Y, k) for any message Y ∉ {Xi} which they have not seen before.
Consider that the people working for you are experts, but the rest of the world on average is not. If one cannot make a judgement whether <https://eprint.iacr.org/2004/207> weakens SHA-2 for a particular use case or not, then it is safer to assume the worse, and simply use SHA-3 instead.
This is tangential, but your facts about tptacek might be a bit out of date. He's no longer at Latacora; he's now working at fly.io. So I think it's safe to say he has recent experience working with people who, while they're skilled developers, aren't security experts.
The production of seed-independent collisions for various versions of murmurhash (e.g., http://emboss.github.io/blog/2012/12/14/breaking-murmur-hash...) motivated siphash. In general, when there's no positive proof of collision bounds, I would assume (un)lucky inputs can collide much more often than usual, even when you change the seed.
The difference between murmurhash and xxhash in that regard is that anyone can download code to generate murmurhash collisions, while code to make xxhash cry is still unknown/secret.
XXH3 is also a lot faster for long strings, and I believe comparable to murmurhash for short inputs.
> This could be super useful for deduplicating large photo datasets!
That is indeed the use case mentioned in the article. A hash like this is useful for deduplicating large _file_ datasets but for deduplicating images in particular, you really want a perceptual hash. Because two images files can _look_ identical but have slightly (or wildly) different byte streams. The trade-off is that perceptual hashes are notorious for false positives in large datasets.
Yep, just happy someone was looking into it, I currently use xxHash for this.
Perceptual hashing for photos isn't great in my opinion, for example I have many raw photos that look very similar, the only difference may be a slightly higher shutter speed, or a correction of the camera angle. I'm guessing many perceptual hashes would mark these as duplicate. Maybe that's what many want, it's not what I want personally.
Came here to say the same thing :-)
Anyone know of a free(ish) software based on a very fast hashing algorith for checking large number of images?
My intention is to check for bit-rot and other unforeseen data corruptions between backup disks, I assume many others must have a similar use case.
Cryptographic hashes mostly look at collisions for the whole output (or precisely truncated outputs). There can still be patterns or clumping in short ranges of the output bits for certain input families, as long as the full outputs differ.
For example, the identity function is extremely resistant to collisions. However, it doesn't mix the input bits, so it's not a great hash function.
Modern cryptographic hashes (SHA3, Blake2/3) are designed to be indistinguishable from a random oracle. And even MD5/SHA1/SHA2 should not be distinguishable by a statistical test suite.
So if any of them fails a test, it means that either:
1. A statistical test found a flaw crytographers did not yet discover (very unlikely, especially since multiple cryptographic hashes fail)
2. The test suite integrated/implemented the hash function incorrectly. This test suite seems to key the hash functions in an unsupported way, though I'm skeptical that's responsible for the failures.
3. The test suite is flawed and has a significant chance of producing test failures for an ideal hash function.
smhasher does not report any failure for blake3. The one failure for blake2b 256 is not ideal for a hash function, but not necessarily evidence that the function doesn't look like a random value: `Sparse` generated 50643 16-bit values, hashed them, and found 2 collisions in the high 32 bits of the output. I'm not sure what kind of flaw in the test harness you think can explain that.
There could definitely be issues in the integration code that lets the harness call into all these functions. For example, smhasher finds issues with SHA3 for the "PerlinNoise" input sets. That input set hashes small integers in [0, 4096), with seeds in [0, 4096); I'm not convinced the sha3 wrapper does anything useful with the seed here https://github.com/rurban/smhasher/blob/37cffd7b9cdaa2140c53... . I expect something similar is happening with SHA1 and SHA2.
The MD5 row shows no failure; only the function that truncates to the low 32 bit has failures.
You can read the test harness or the test log (e.g., https://github.com/rurban/smhasher/blob/master/doc/blake2b-2...) and apply your own significance threshold. The statistical tests are nothing special or novel (counting collisions in bitranges of the input, and bias in individual bits, mostly); the interesting part is how the various tests generate interesting sets of inputs. In the end, it's a bit like the PRNG wars: you can always come up with a test that makes a function look bad, but a ton of failure is definitely a bad sign.
Considering the number of tests and hash functions, the failed tests will simply be noise. The default config should use a much higher significance threshold, testing for a number of seeds and looking at the combined statistics would be one way of achieving that.
> The MD5 row shows no failure; only the function that truncates to the low 32 bit has failures.
Truncated MD5 should behave like an ideal 32-bit hash function.
> it's a bit like the PRNG wars: you can always come up with a test that makes a function look bad, but a ton of failure is definitely a bad sign.
For a CSPRNG you shouldn't be able to come up with any test that'll fail more often that you'd statistically expect from truly random numbers (averaging over random enough random seeds).
This is great! Many applications (like dedupe) don't need full crypto guarantees.
If you need something fast, and crypto secure, I recommend checking out Blake3/b3sum. I'm just learning about XXH3 in this thread so I cannot comment on how it stacks up but I love b3sum for fast file hashing.
> The Meow hash is a high-speed hash function named after the character Meow in Meow the Infinite.
As someone who doesn’t watch cartoons I’m constantly surprised at the influence cartoons and anime is having on modern tech especially with regards to computer science.
I guess you should avoid using stuff like a Python interpreter, a Squeak VM, socks proxies or Cha Cha encryption too.
If your team dismisses good solutions because it has "meow" in the name, you probably need a better team. I would recommend that they alias their `cat` command as "concatenate" in order to prevent further ridicule.
It's a pretty small app for some researchers. Front end is Angular, backend is an API in Bottle. At most, there's 10 people generating data, and about 40 people on reading data. Most times are 10 users uploading, and 5-10 users reading. The readers are external partners. Hashid is to obfuscate ids we have in another system so these external people cannot tie the data back due to privacy issues. It generates about 1000 rows of data across 4 PostGresql tables in a given day.
I like how the algorithm has been well supported across languages. We have no problems with it on the Python side. An analyst runs monthly reports using R against the data (so about 30k records pulled. Numbers are summarized and stats are generated. The hashing isn't a bottleneck at all.
As noted in the post title, the original article is from 2018. I had either forgotten about Meow hash or just never knew about it, so I appreciate the post, but I was curious about the timing of the post.
cmuratori mentioned meowhash in passing in a recent (i.e. few days ago) Refterm or optimisation video. It is possible that user DeathArrow watched this, too, and decided to submit the project homepage to HN.
Due to recent discoveries by Peter Schmidt-Nielsen, we have decided to reclassify Meow hash 0.5/calico from level 3 to level 1. This means that we recommend not to use this hash for message authentication codes, or for hash tables in scenarios where collision induced denial-of-service attacks are a concern.
For level 3/MAC capabilities consider migrating to SipHash. If the performance of SipHash is not satisfying, continuing to use Meow 0.5 for hash tables is better than migrating to another fast hash.
It's not the highest of quality hash functions (see the SMHasher benchmarks), but it is fast. A great alternative is XXH3 (https://cyan4973.github.io/xxHash/), which has seen far more usage in practice.