Meow Hash (2018)

anonova · on Oct 29, 2021

A detailed analysis of Meow Hash: https://peter.website/meow-hash-cryptanalysis

It's not the highest of quality hash functions (see the SMHasher benchmarks), but it is fast. A great alternative is XXH3 (https://cyan4973.github.io/xxHash/), which has seen far more usage in practice.

thomasahle · on Oct 29, 2021

XXH3 has issues too. There are certain inputs, independent of the seed, on which it collides much more often than it should.

On the other hand a hash like UMash guarantees low collisions on any input: https://engineering.backtrace.io/2020-08-24-umash-fast-enoug...

jdcarter · on Oct 29, 2021

I'm using XXHash3 for verifying data integrity of large (10+MB) blobs. Very fast and appears to work very well in my testing--it's never missed any bit error I've thrown at it.

Aside: when storing hashes, be sure to store the hash type as well so that you can change it later if needed, e.g. "xxh3-[hash value]". RFC-6920 also has things to say about storing hash types and values, although I haven't seen its format in common use.

njt · on Oct 31, 2021

> be sure to store the hash type as well so that you can change it later if needed

Thanks for sharing this, I'd been doing this on my own for my own stuff (eg. foo.txt-xxh32-ea79e094), but it's good to know someone else has thought it through.

I ran into the problem once where someone had named some files foo-fb490c or something similar without any annotation, and when there was a problem, it took a file to figure out they were using truncated sha256 hashes.

DeathArrow · on Oct 29, 2021

Useless analysys since the author says it's not a cryptographic hash but useful as a fast hash for change detection.

"we wanted a fast, non-cryptographic hash for use in change detection and deduplication"

>A great alternative is XXH3

Meow Hash is twice as fast.

MauranKilom · on Oct 29, 2021

If you had made it one section into the analysis, you would have seen that at the time MeowHash made certain cryptographic claims that the author set out to disprove.

The readme has since been updated. I didn't check whether any algorithmic changes were made on top, but the discussion of the analysis on github didn't point to a lot of low-hanging fruit.

LeoPanthera · on Oct 29, 2021

It's not useless analysis, because even for non-cryptographic hashes you want the likelihood of any arbitrary hash to be roughly equal. A hash function which "prefers" certain outputs has a far higher probability of collision.

IncRnd · on Oct 29, 2021

Don't you think asset planting is an attack against a game's pipeline?

The author of the article's page claims the hash is not cryptographic but actually goes on to make security claims about the hash. People who do not understand cryptography should be careful about making such claims. The author appear to understand this more than your comment demonstrates.

For example, a claim about change detection is a cryptographic claim of detecting preimage attacks. In a threat model, a security professional would determine whether a first preimage or a second preimage attack is what should be guarded in attack scenarios. Then, the professional would help with analysis, determining mitigations, defense in depth, and prioritization of fixing the vulnerabilities exposed by how the hash is used.

A hash cannot be considered standalone. It is the architecture and use-case where the hash's security properties are used to determine what security properties of the application are fulfilled.

So, if the author is correct, which seems to be the case, then meowhash should not be used in a production environment outside of the most simplistic checks. It seems faster for its intended use case to simply check for a single bit difference between two images - no hash required.

incrudible · on Oct 30, 2021

> Don't you think asset planting is an attack against a game's pipeline?

Realistically? No. It is one of those hypothetical scenarios where if an attacker can plant files into your file system, you have bigger problems.

https://devblogs.microsoft.com/oldnewthing/20060508-22/?p=31...

daemin · on Oct 30, 2021

Realistically hashes are used during the development of a game to detect when a file or asset hash changed, and therefore it will trigger regeneration of assets that depend on it. For some long or slow build steps it is better to rely on a hash changing than the timestamp changing to trigger a build. It can also be used to fetch pre-built data from where it's cached on a server.

That's why it should be fast and doesn't have to be cryptographic.

Regarding security if you're at the point of someone malicious creating a file on your internal systems you've already lost the battle.

null000 · on Oct 29, 2021

> it seems faster for its intended use...

But then you have to store the entire before & after locally? That's the entire point of using a hash for change detection.

IncRnd · on Oct 29, 2021

> But then you have to store the entire before & after locally?

Yes, there is difference between two (as you say) and there is integrity (modification detection). In the case of comparing new assets in a pipeline to those that were created earlier, it sounds plausible both copies would be present.

> That's the entire point of using a hash for change detection.

This is called integrity protection. Change detection is the incorrect term to use here. Please see what I referenced earlier for first and second preimage.

jrochkind1 · on Oct 30, 2021

What determines whether a hash is "cryptographic"? What would make it suitable for change-detection but not be "cryptographic"? Is the claim here that it would not be suitable for detecting "malicious" changes, but is still suitable for detecting "natural" changes?

gleenn · on Oct 30, 2021

A couple features are that is hard for an attacker to find a collision, and it is hard to reverse the original data from the hash. Both can cause serious security problems but aren't necessary in something like this where most image changes are "natural" and you gain speed by weakening those constraints.

ncann · on Oct 29, 2021

Saying it's twice as fast is rather misleading? They can both hash as fast as RAM speed allows anyway. And if it's something in cache I doubt one is significantly better than the other.

iratewizard · on Oct 29, 2021

Meow hash is also written by people who initially thought SHA-1 was acceptable for large scale change hashing.

ilitirit · on Oct 29, 2021

I can vouch for xxHash (it does what it says on the can), but I'm really curious to hear from who have experience with meow hash.

arduinomancer · on Oct 29, 2021

A collision was reported a couple months ago

https://github.com/cmuratori/meow_hash/issues/80

Accacin · on Oct 29, 2021

The back and forth in that thread is incredibly interesting too.

cb321 · on Oct 29, 2021

I have measured (on a Skylake core) the hash from Daniel Lemire, Owen Kaser, "Faster 64-bit universal hashing using carry-less multiplications, Journal of Cryptographic Engineering (to appear)" at t_cycles = (0.00741 +- 0.00098 cycles/byte) * bytes + (37.43 +- 0.48) which is roughly 135 B/cycle (675 GB/s at 5 GHz) or over 2x faster than the claim here. Admittedly, that 37 cycle fixed cost is probably quite a bit higher, and I don't know how well the bit mixing compares.

Parenthetically, a time equation (like t_cycles = coef * bytes + fixedCost) is a more convenient way to summarize performance of this kind of thing than the constant short string/long string mumble-mumble. Yes, loop unrolling/etc. can create step-ish functions and a linear regression is pretty approximate (but my stderrs are pretty small above), but even so..much less tripping over words.

pkhuong · on Oct 29, 2021

umash (https://github.com/backtrace-labs/umash) has a similar structure PH block structure (and similar universality guarantees), but was designed for decent bit mixing (enough to satisfy smhasher, unlike CLHASH, which needs an additional finalizer) with a lower fixed time cost: 22 cycles for a one-byte hash on a 2.5 GHz Xeon 8175M.

I'm not sure how one would use that linear regression. What kind of hardware offers 675 GB/s of memory bandwidth? 140 bytes/cycle is easily more than twice the L2 read bandwidth offered by any COTS chip I'm aware of. There are also warm up effects past the fixed cost of setup and finalizers that slow down hashing for short input. For what range of input sizes (and hot/cold cache state) would you say the regression is a useful model?

cb321 · on Oct 29, 2021

I was only doing short < 1024 byte strings in L1 (like URI lengths-ish). I'm well aware no memory system on typical HW can deliver that (EDIT: "at large space scale"). The article itself mentions a similar calculation (but at their slower scale) which is why I included it.

And yes, warm up absolutely matters. My numbers are for the best case path which is among the only easy things to even define, IMO. Worst and average are sooooo much harder/assumption-riddled. Then you argue a lot about said assumptions.

I do the minimum times of many runs in an inner loop and the regression on those min-time results. So, the min-loops will warm up everything like branch predictors, etc. I'd suspect the regression model would hold up to usable L1 cache (e.g. 30 KiB or so). Beyond that, it's mostly just measuring the memory system not the hash function..and I'm not sure what the point of that is. Separation of concerns would suggest L1 perf is most relevant to the "hash calculation part". (IMO, many benchmarks confuse this.)

And, yeah..modern memory systems suck. Simple back of the envelope would suggest some DIMM-oriented workload would be waiting on the DIMMs like 95% of the time, usually much higher on Xeons with their low single core BW (maybe 98%!). The Meow article sort of alludes to that as well with less detail. Single core Xeon BW is always a huge disappointment. Forces you to go (very) parallel just to saturate DIMMs (EDIT: which in turn cannot saturate even a single L1!).

BTW, it sounds like you know all of this (and maybe more) and I don't mean to suggest otherwise, but I thought some numbers/details might assist other passersby.

{ EDIT: Also, the stderr on that slope is ~13%. So, it's possible 135 is really 128 (1.05x lower), the 2 cache line/cycle load limit or something like that. This doesn't mean the 640 GB/s rate number is wrong - it just underscores how mismatched memory systems are at their various scales. }

{ EDIT: and yes, 128/16 = 8x faster than meow, not 2x. Oops. }

pbsd · on Oct 29, 2021

1/135 cycles per byte on Skylake is just plain impossible, even if the hash consisted of simply one xor per 32 bytes of input.

The lower bound for CLHASH would be the cost of one carryless multiplication per 16 bytes of input, or in other words 1/16 ~ 0.0625 cycles per byte, since you can only issue one such multiplication per cycle.

cb321 · on Oct 29, 2021

Feel free to measure it yourself instead of just speculating. (And as mentioned elsewhere, it is probably 1/128.) { EDIT: and I never measured meow - it could be faster than 1/16 cycle/byte but poorly assessed; just be sure to use a min loop and small data. }

brandmeyer · on Oct 30, 2021

Microbenchmark measurements that are faster than the memory system are frequently a symptom that the compiler optimized most or all of the unit under test into a no-op.

cb321 · on Oct 30, 2021

It isn't faster than the L1 memory system which is the most relevant thing to measure to compare hash algos, as already explained. (duplicated here for clarity/flow)

Also, I perhaps could more fully motivate my time equation parenthetical. With such an equation in hand, one can easily compare hash functions with differing trade offs, such as Wang Yi's hash and Farm Hash:

    mem-wy  : tCC = (0.1573 +- 0.0039)*bytes + (7.10  +- 0.13)
    mem-farm: tCC = (0.3017 +- 0.0014)*bytes + (2.347 +- 0.044)

    t1 = t2 => m1*n + b1 = m2*n + b2 => n = (b1-b2)/(m2-m1)
    n = (7.1 - 2.347) / (.3017 - .1573)
      = (4.753 +- .137) / (0.1444 +- 0.0041)
      = 32.9 +- 1.3 bytes

where this final number is how long strings need to be before Wang Yi's is faster than Farm. So, it is a nicely convenient summary. (And this is true even without my fancy statistical error estimates...)

Of course full charts/graphs/yadda-yadda can be even better, but just two summary numbers can also be nice.

cb321 · on Oct 30, 2021

It isn't faster than the L1 memory system which is the most relevant thing to measure to compare hash algos, as already explained.

kazinator · on Oct 29, 2021

> Because we have to process hundreds of gigabytes of art assets to build game packages, we wanted a fast, non-cryptographic hash for use in change detection and deduplication. We had been using a cryptographic hash (SHA-1), but it was unnecessarily slowing things down.

That's because you were doing a potentially silly thing: hashing every byte.

What does a hash tell you? If two hashes are different, it confirms that the objects are different. If two hashes are the same, they could be the same object or they could be different.

For that purpose, you don't have to hash everything: just hash the areas where objects are likely to differ.

If two 15 gigabyte files are likely to have differences in their first kilobyte, then hashing just first kilobyte (negligibly fast) will work just as well as hashing 15 gigabytes.

If two files are of different length, then they are not the same. How about:

hash(<length> + <first kilobyte> + <last kilobyte>)

For deduplication, a 32 bit CRC could probably be reasonably efficient. No matter what hash you use, you have to verify the collisions. If CRC32 is enough to make false positives rare over your type of data set, it's good enough.

andruc · on Oct 29, 2021

"just hash the areas where objects are likely to differ"

Correct me if I'm wrong but given these are art assets, wouldn't "determining where the objects are likely to differ" itself be an expensive/hard problem?

daemin · on Oct 30, 2021

Once you need to load any parts of a file into memory you might as well compute the hash for all the bytes that you've loaded into memory. The act of loading the bytes from permanent storage (even NVME SSD) is likely to be many orders of magnitude slower than actually computing the hash. So it makes no sense to just "hash parts of the file" when in all likelihood the entire file will be getting loaded into memory even if you actively read only a few bytes from each 4k block.

cb321 · on Oct 29, 2021

It really depends. The early bytes probably contain things like image dimensions which might discriminate a lot - or not at all. They might contain palette-style color information which might discriminate tons - or not at all. Most things with hash functions depend upon distributions of inputs relative to the function, though.

I think it would be accurate to say "Determining where they are likely to differ is not necessarily hard, but it could be." If it happens to be easy then you can radically change the scale of the problem.

File length alone is a weak hash actively maintained by any file system and so available "for free" (or one stat call). And it can work surprisingly well, as the grandparent alludes to.

EDIT: You actually do not need to even open(2) files that are not part of "same size clusters", never mind hash their data. This alone can really squash the problem size over a more naive approach even if you did feel the need to hash the whole file.

EDIT: And in terms of performance, if this data has to actually be read off a drive then the hash need not go any faster than the drive IO which is probably much, much slower than most candidate hash functions. Even PCIe5.0 NVMes are only like 14 GB/s, SATA buses like 0.75 GB/s, etc.

kazinator · on Oct 29, 2021

To avoid the hash being entirely oblivious to bytes beyond a small block at the start, we could take additional samples of the file at exponentially increasing offsets:

Hash the first kilobyte.

Hash the 16th kilobyte

Hash the 256th kilobyte

Hash the 4096th kilobyte (4M)

Hash the 65536th kilobyte (64M)

Hash the 1048576th kilobyte (1G)

Then 16G, 256G, ...

It might be worth including the last kilobyte. (I think I mentioned that.) If the file format has trailing metadata, that can help, especially if there is any sort of checksum in it.

cb321 · on Oct 29, 2021

Sure. I agree various sampling strategies are a good design, as long as they don't add up to much IO. Making that a "user parameter" to be customizable seems ideal. What you suggest may make a fine default - if adapted to varying file sizes somehow. E.g., that Nim dups program I mentioned elsewhere could just take a sequence of slices instead of only one..maybe just ignoring out of bounds slices. { And for size clusters where the size < N, screw it - just hash the whole thing, probably. ;-) }

DeathArrow · on Oct 30, 2021

Finding sections of the file where data can be different is not only potentially hard but depends on the type of data. So can't be made generic.

Comparing files based on size is trivial and I presume people will always do it first before attempting to use a hash or even CRC.

As you say, single disk read speeds are slow but there is a high chance they will improve over time.

PCIe 6.0 x16 has a bandwidth of 128 GB/s. It would be theoretically possible to reach that aggregate speed in raid 0.

cb321 · on Oct 30, 2021

>Finding sections of the file where data can be different is not only potentially hard but depends on the type of data. So can't be made generic

I agree, and a cousin comment of mine suggests user-specifiable sampling which can always be a full sample in the "hard case". Often, once one has identified true duplicate candidates, a full compare is desirable to not trust any hash, and that full compare uses a lot of IO. That depends on the user, too.

>Comparing files based on size is trivial and I presume people will always do it first before attempting to use a hash or even CRC.

It is unclear the meow guys did this since they seemed to motivate hash performance based on dedup and they unlikely had IO > 10GB/s in 2018, but it's hard to know for sure.

pepoluan · on Oct 29, 2021

CRC32 is actually very slow in modern CPUs. So if you can use an algorithm that is both (1) faster and (2) better quality, why not use that algo instead?

midnightclubbed · on Oct 29, 2021

I wouldn't categorize it as 'very slow', at worse 50% slower than 128bit XXH3 hash (and faster than the 32bit XXH hash).

Modern intel processors have the CRC32C instructions and ARM have both CRC32 and CRC32C. There is a reluctance in some of the hash comparisons to use the hardware CRC implementations for compatibility reasons, but most modern CPUs have them.

The main argument for using a hardware CRC32 (and to some extent a software one too) is simplicity. If your use-case needs decent (but not blinding) performance, then there is something to be said for a hash implementation that can be written in ~10 lines of code. You still have to handle the possibility of collisions but any 32bit hash has that restriction, which is fine for many uses.

kazinator · on Oct 29, 2021

> CRC32 is actually very slow in modern CPUs

Dd you read my whole comment? If we are hashing just a few kilobytes of a multi-gigabyte file, it almost certainly doesn't matter.

cb321 · on Oct 29, 2021

This is roughly how https://github.com/c-blake/cligen/blob/master/examples/dups.... can work. There is little question that changing the scale of the problem very much lessons performance sensitivity to the hash.

Cilvic · on Oct 29, 2021

I know +1 is against the rules, but I think this deserves a positive comment. I enjoyed reading it, thanks for taking the time and giving back to the ecosystem.

wolf550e · on Oct 29, 2021

How does it compare to XXH3?

https://github.com/Cyan4973/xxHash

https://github.com/rurban/smhasher#readme says XXH3 is 16GB/s while meow is 17GB/s.

caeril · on Oct 29, 2021

> How does it compare to XXH3?

XX is a slightly better hashing function, meow is slightly faster.

Take your pick. Life is about tradeoffs.

pepoluan · on Oct 29, 2021

Well, the page for MeowHash is written in 2018.

xxHash Performance Comparison page is last updated in 2021, and in that page XXH3 is faster than MeowHash:

https://github.com/Cyan4973/xxHash/wiki/Performance-comparis...

DeathArrow · on Oct 29, 2021

i7-9700K can reach up to 4.9 Ghz and Meow Hash can hash at a rate of 16 bytes per cycle. That means 78 GB/s.

xxHash rate is 31.5 GB/s on that CPU.

wolf550e · on Oct 29, 2021

Scroll down, XXH3 is 59 GB/s using AVX2 while meow is 58 GB/s in their benchmark.

NelsonMinar · on Oct 29, 2021

I wish there was a consensus non-cryptographic hash algorithm. Something that hashes arbitrary data to 128 or 256 bit keys with good randomness, few collisions on typical input data, and universal implementation.

Most programmers I know reach for SHA-1 (or if we're being honest, MD5). But a cryptohash is really wasteful if you don't need the cryptographic properties.

Last I looked at this Murmurhash was the closest to widely used with some folks saying Metrohash was better. But that was years ago. Judging by this discussion xxHash is the new hotness?

chronogram · on Oct 29, 2021

You want to have XXH128 for that right? 128 bit, portable, virtually impossible to collide, and only slightly slower than XXH3 while still way faster than most options.

pepoluan · on Oct 29, 2021

xxHash is not really "new", it's been available since 2014. At least, that's the oldest Python bindings available on PyPI:

https://pypi.org/project/xxhash/0.0.1/

Sure, as it developed, xxHash variants appear. The latest in the family is XXH3 and it's already being used at least since 2019:

http://fastcompression.blogspot.com/2019/03/presenting-xxh3....

tromp · on Oct 29, 2021

> I wish there was a consensus non-cryptographic hash algorithm

I think Siphash serves that role pretty well. It's one of the most if not the most secure among non-cryptographic hashes, and quite speedy, especially with SIMD implementations. One need only consider alternatives if one wants to trade off a bit of security for even higher speed.

NelsonMinar · on Oct 29, 2021

Siphash is more like a crypto has right? According to smhasher its performance is somewhere between a non-crypto hash like xxHash or Meow and something like SHA-1.

tromp · on Oct 29, 2021

As its Wikipedia page says:

Although designed for use as a hash function to ensure security, SipHash is fundamentally different from cryptographic hash functions like SHA in that it is only suitable as a message authentication code: a keyed hash function like HMAC. That is, SHA is designed so that it is difficult for an attacker to find two messages X and Y such that SHA(X) = SHA(Y), even though anyone may compute SHA(X). SipHash instead guarantees that, having seen Xi and SipHash(Xi, k), an attacker who does not know the key k cannot find (any information about) k or SipHash(Y, k) for any message Y ∉ {Xi} which they have not seen before.

bmn__ · on Oct 29, 2021

> good randomness, few collisions on typical input data, and universal implementation

The relevant standards body NIST has issued SHA-3 in 2015.

Spread the word that programmers should not reach for outdated hash algos. <https://valerieaurora.org/hash.html>

tptacek · on Oct 29, 2021

The chart on that page has always squicked me out a bit, because SHA-2 is certainly still "considered strong" by cryptography engineers.

bmn__ · on Oct 30, 2021

Consider that the people working for you are experts, but the rest of the world on average is not. If one cannot make a judgement whether <https://eprint.iacr.org/2004/207> weakens SHA-2 for a particular use case or not, then it is safer to assume the worse, and simply use SHA-3 instead.

mwcampbell · on Oct 31, 2021

> the people working for you are experts

This is tangential, but your facts about tptacek might be a bit out of date. He's no longer at Latacora; he's now working at fly.io. So I think it's safe to say he has recent experience working with people who, while they're skilled developers, aren't security experts.

wolf550e · on Oct 30, 2021

Marc Stevens who broke sha1 has said he doesn't think sha2 will get broken any time soon. See https://twitter.com/veorq/status/1128273048367398912

tptacek · on Oct 30, 2021

See, case in point. She should change the chart so SHA-2 has an all-green line. It's clearly misleading people.

throwaway984393 · on Oct 30, 2021

Which do you want? Something non-wasteful, or something good with consensus? Something that everybody agrees on isn't necessarily fast or new or hot.

pspeter3 · on Oct 29, 2021

I’m curious too about the differences between Murmurhash and xxHash. I’ve been using Murmur in my side projects.

pkhuong · on Oct 29, 2021

The production of seed-independent collisions for various versions of murmurhash (e.g., http://emboss.github.io/blog/2012/12/14/breaking-murmur-hash...) motivated siphash. In general, when there's no positive proof of collision bounds, I would assume (un)lucky inputs can collide much more often than usual, even when you change the seed.

The difference between murmurhash and xxhash in that regard is that anyone can download code to generate murmurhash collisions, while code to make xxhash cry is still unknown/secret.

XXH3 is also a lot faster for long strings, and I believe comparable to murmurhash for short inputs.

buildbot · on Oct 29, 2021

This could be super useful for deduplicating large photo datasets!

It might be worth changing this link to the github, there’s a lot more info on the hash there.

bityard · on Oct 29, 2021

> This could be super useful for deduplicating large photo datasets!

That is indeed the use case mentioned in the article. A hash like this is useful for deduplicating large _file_ datasets but for deduplicating images in particular, you really want a perceptual hash. Because two images files can _look_ identical but have slightly (or wildly) different byte streams. The trade-off is that perceptual hashes are notorious for false positives in large datasets.

kayamon · on Oct 29, 2021

I once saw a guy dedupe a folder by batch-converting them to low quality JPEG thumbnails then sorting by size.

buildbot · on Oct 29, 2021

Yep, just happy someone was looking into it, I currently use xxHash for this.

Perceptual hashing for photos isn't great in my opinion, for example I have many raw photos that look very similar, the only difference may be a slightly higher shutter speed, or a correction of the camera angle. I'm guessing many perceptual hashes would mark these as duplicate. Maybe that's what many want, it's not what I want personally.

pepoluan · on Oct 29, 2021

And what's wrong with xxHash?

According to xxHash's performance comparison page, xxHash is faster than MeowHash.

NKosmatos · on Oct 29, 2021

Came here to say the same thing :-) Anyone know of a free(ish) software based on a very fast hashing algorith for checking large number of images? My intention is to check for bit-rot and other unforeseen data corruptions between backup disks, I assume many others must have a similar use case.

radicality · on Oct 30, 2021

I haven't used it extensively, but this is free and you can pick from a number of hash functions - https://github.com/qarmin/czkawka

apavlo · on Oct 29, 2021

SMHasher benchmark numbers are here (including Meow Hash):

https://github.com/rurban/smhasher#readme

CodesInChaos · on Oct 29, 2021

The quality column on that page seems dubious, since it lists test failures for secure cryptographic hashes.

pkhuong · on Oct 29, 2021

Cryptographic hashes mostly look at collisions for the whole output (or precisely truncated outputs). There can still be patterns or clumping in short ranges of the output bits for certain input families, as long as the full outputs differ.

For example, the identity function is extremely resistant to collisions. However, it doesn't mix the input bits, so it's not a great hash function.

CodesInChaos · on Oct 29, 2021

Modern cryptographic hashes (SHA3, Blake2/3) are designed to be indistinguishable from a random oracle. And even MD5/SHA1/SHA2 should not be distinguishable by a statistical test suite.

So if any of them fails a test, it means that either:

1. A statistical test found a flaw crytographers did not yet discover (very unlikely, especially since multiple cryptographic hashes fail) 2. The test suite integrated/implemented the hash function incorrectly. This test suite seems to key the hash functions in an unsupported way, though I'm skeptical that's responsible for the failures. 3. The test suite is flawed and has a significant chance of producing test failures for an ideal hash function.

pkhuong · on Oct 30, 2021

smhasher does not report any failure for blake3. The one failure for blake2b 256 is not ideal for a hash function, but not necessarily evidence that the function doesn't look like a random value: `Sparse` generated 50643 16-bit values, hashed them, and found 2 collisions in the high 32 bits of the output. I'm not sure what kind of flaw in the test harness you think can explain that.

There could definitely be issues in the integration code that lets the harness call into all these functions. For example, smhasher finds issues with SHA3 for the "PerlinNoise" input sets. That input set hashes small integers in [0, 4096), with seeds in [0, 4096); I'm not convinced the sha3 wrapper does anything useful with the seed here https://github.com/rurban/smhasher/blob/37cffd7b9cdaa2140c53... . I expect something similar is happening with SHA1 and SHA2.

The MD5 row shows no failure; only the function that truncates to the low 32 bit has failures.

You can read the test harness or the test log (e.g., https://github.com/rurban/smhasher/blob/master/doc/blake2b-2...) and apply your own significance threshold. The statistical tests are nothing special or novel (counting collisions in bitranges of the input, and bias in individual bits, mostly); the interesting part is how the various tests generate interesting sets of inputs. In the end, it's a bit like the PRNG wars: you can always come up with a test that makes a function look bad, but a ton of failure is definitely a bad sign.

CodesInChaos · on Oct 30, 2021

`Sparse` generated 50643 16-bit values, hashed them, and found 2 collisions in the high 32 bits of the output

This test has a 3.6% chance of producing two or more collisions. https://www.wolframalpha.com/input/?i=probability+x%3E%3D2+p...

Considering the number of tests and hash functions, the failed tests will simply be noise. The default config should use a much higher significance threshold, testing for a number of seeds and looking at the combined statistics would be one way of achieving that.

> The MD5 row shows no failure; only the function that truncates to the low 32 bit has failures.

Truncated MD5 should behave like an ideal 32-bit hash function.

> it's a bit like the PRNG wars: you can always come up with a test that makes a function look bad, but a ton of failure is definitely a bad sign.

For a CSPRNG you shouldn't be able to come up with any test that'll fail more often that you'd statistically expect from truly random numbers (averaging over random enough random seeds).

cateye · on Oct 29, 2021

I really enjoyed reading this the clear and concise way of explanation as documentation.

While I was totally not interested in a hash function whatsoever, I read the whole thing :)

rurban · on Oct 29, 2021

For comparison: https://github.com/rurban/smhasher/

It's indeed pretty fast ( among the top 5), but not particularly good. It fails some basic tests. Rather use xxh3, t1ha or farmhash

daniel-thompson · on Oct 29, 2021

Can you elaborate on what it fails? https://mollyrocket.com/meowhash claims:

> It cleanly passes every test in smhasher

pepoluan · on Oct 29, 2021

MeowHash page last updated 2018.

smhasher page last updated several days ago.

kortex · on Oct 29, 2021

This is great! Many applications (like dedupe) don't need full crypto guarantees.

If you need something fast, and crypto secure, I recommend checking out Blake3/b3sum. I'm just learning about XXH3 in this thread so I cannot comment on how it stacks up but I love b3sum for fast file hashing.

https://github.com/BLAKE3-team/BLAKE3/tree/master/b3sum

oefrha · on Oct 29, 2021

Discussed at the time: https://news.ycombinator.com/item?id=18262627

dang · on Oct 29, 2021

Thanks! Macroexpanded:

Meow Hash: A high-speed non-cryptographic hash function - https://news.ycombinator.com/item?id=18262627 - Oct 2018 (68 comments)

Also:

Cryptanalysis of Meow Hash - https://news.ycombinator.com/item?id=27978878 - July 2021 (9 comments)

ed25519FUUU · on Oct 29, 2021

> The Meow hash is a high-speed hash function named after the character Meow in Meow the Infinite.

As someone who doesn’t watch cartoons I’m constantly surprised at the influence cartoons and anime is having on modern tech especially with regards to computer science.

phtrivier · on Oct 29, 2021

In this case, the comic Meow the Infinite is actually done by the same game shop behing the meow hash.

caeril · on Oct 29, 2021

Not sure you can call MR a "game shop". Pretty sure it's just Casey, with Anna on contract.

arduinomancer · on Oct 29, 2021

To add some context Casey is one of the creators of Meow the Infinite

lmilcin · on Oct 29, 2021

Now the only issue is, I will be an instant subject of ridicule if my team sees me using something called "Meow Hash".

recursive · on Oct 29, 2021

Not everyone can do it, but I recommend ignoring ridicule that's not well founded.

jnwatson · on Oct 29, 2021

You better not use a VCS that’s also a common insult then.

gkbrk · on Oct 30, 2021

I guess you should avoid using stuff like a Python interpreter, a Squeak VM, socks proxies or Cha Cha encryption too.

If your team dismisses good solutions because it has "meow" in the name, you probably need a better team. I would recommend that they alias their `cat` command as "concatenate" in order to prevent further ridicule.

failrate · on Oct 29, 2021

Too bad, there's also the very useful Squirrel3 Noise Hash.

0des · on Oct 29, 2021

wiskinator · on Oct 29, 2021

I love this website because I went to the link, didn't know what a PWA was, went to the root website and STILL don't know what a PWA is.

I'm going to try the old trick of saying the wrong thing so someone corrects me.

Clearly a PWA is a Pug With Attitude. Is this a Jonathon and Noodle sponsored app?

bowmessage · on Oct 29, 2021

I think you may have commented on the wrong article.

(Did you mean to comment on "Publish Your PWAs to the iOS App Store" [0])?

[0] https://news.ycombinator.com/item?id=29040793

PebblesRox · on Oct 29, 2021

A PWA is a Progressive Web App.

https://developer.mozilla.org/en-US/docs/Web/Progressive_web...

snapetom · on Oct 29, 2021

I've been using Hashids (https://hashids.org/). It's been fine performance-wise and well supported across languages.

andruc · on Oct 29, 2021

Care to share your definition of fine, and give at least a ballpark on the size of the assets you're using it on?

snapetom · on Oct 29, 2021

It's a pretty small app for some researchers. Front end is Angular, backend is an API in Bottle. At most, there's 10 people generating data, and about 40 people on reading data. Most times are 10 users uploading, and 5-10 users reading. The readers are external partners. Hashid is to obfuscate ids we have in another system so these external people cannot tie the data back due to privacy issues. It generates about 1000 rows of data across 4 PostGresql tables in a given day.

I like how the algorithm has been well supported across languages. We have no problems with it on the Python side. An analyst runs monthly reports using R against the data (so about 30k records pulled. Numbers are summarized and stats are generated. The hashing isn't a bottleneck at all.

cbsmith · on Oct 29, 2021

As noted in the post title, the original article is from 2018. I had either forgotten about Meow hash or just never knew about it, so I appreciate the post, but I was curious about the timing of the post.

bmn__ · on Oct 29, 2021

> I was curious about the timing of the post.

cmuratori mentioned meowhash in passing in a recent (i.e. few days ago) Refterm or optimisation video. It is possible that user DeathArrow watched this, too, and decided to submit the project homepage to HN.

amenghra · on Oct 29, 2021

Would be interesting to compare it to siphash.

adamgordonbell · on Oct 29, 2021

Found this in the readme:

    Due to recent discoveries by Peter Schmidt-Nielsen, we have decided to reclassify Meow hash 0.5/calico from level 3 to level 1. This means that we recommend not to use this hash for message authentication codes, or for hash tables in scenarios where collision induced denial-of-service attacks are a concern.
    For level 3/MAC capabilities consider migrating to SipHash.  If the performance of SipHash is not satisfying, continuing to use Meow 0.5 for hash tables is better than migrating to another fast hash.

CodesInChaos · on Oct 29, 2021

Siphash has pretty bad performance on large inputs. From what I remember, even MD5 was faster.

Comevius · on Oct 29, 2021

Google made faster Siphash variants, and also HighwayHash that's much faster.

https://github.com/google/highwayhash

scriptdevil · on Oct 29, 2021

I recently used seahash in rust for a similar job. This seems to fall in the same niche.

alecco · on Oct 29, 2021

(2018)

da_big_ghey · on Oct 29, 2021

Seahash came out very well last I have seen in it, any comparison? https://docs.rs/seahash/4.1.0/seahash/

pepoluan · on Oct 29, 2021

XXHash author last updated the following page on Aug 10 of this year:

https://github.com/Cyan4973/xxHash/wiki/Performance-comparis...

Traubenfuchs · on Oct 29, 2021

> we have benchmarked all the ones we could find

Those results would increase the value of this article from "very low" to "medium".

chalst · on Oct 29, 2021

The README for xxhash has benchmarks covering fast hashes including Meow:

https://github.com/Cyan4973/xxHash/wiki/Performance-comparis...