Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They completely forget the startup-time in the query-time, which dominates by a factor of 1000.

Some PHF's can be pre-compiled, while most needs to be deserialized at run-time. I worked on a pre-compiled pthash variant, but got struck by C++ bugs.

There's a huge overhead for ordered variants in some, to check for false positives.

For small n gperf is still the fastest by far. And it is pre-compiled only.



From what you describe, I think you have a somewhat special use case: it sounds like you are compiling it. The experiments done in the survey are not: instead, the hash function is used in the form of a data structure, similar to a Bloom filter ("deserialized" in your words). Do you use it for a parser? How many entries do you have? The survey uses millions of entries. "Startup time in the query time": I'm not quite sure what you mean, I'm afraid. Could you describe it?

I'm also not sure what you mean with "check for false positives"... each entry not in the set returns a random value, maybe for you this is a "false positive"? A typical solution for this is to add a "fingerprint" for each entry (just a hash value per entry) - that way there is still some false positive probability, which may or may not be acceptable (it is typically considered acceptable if the hash function is cryptographically secure, and the fingerprint size is big enough: e.g. 128 bits of the SHA-256 sum). Depending on the data, it may be faster to compare the value (eg. in a parser).


using it like gperf is certainly not a special case. if your keys are fixed and known at compile-time, you can also pre-compile the hash function with its data structures, and not being forced to deserialize it at startup.

when comparing those MPFH query times the startup-time, the deserialization from disk, is 1000x higher than the actual query time. when you compile those data structures, the load time is instant. also memory usage is twice as low.


Many modern perfect hash functions are super close to the space lower bound, often having just between 3% and 50% overhead. So your claim that the space consumption "twice as low" is information theoretically impossible. With gperf, the space consumption is in the machine code instead of a data structure, but it's definitely not "for free". In fact, I'm pretty sure that the size of the machine code generated by gperf is very far from optimal. The only exception are tiny input sets with just a couple of hundred keys, where gperf probably wins due to lower constant overheads.


My special use case is i.e. unicode property checks, for which gperf is not big enough, and which has millions of keys. Other cases are integer keys.

I'm certainly not willing to load they keys and mpfh properties at query-time from disc, as they are known in advance and can be compiled to C or C++ code in advance, which leads to an instant load-time, in opposition to your costly deserialization times in all your tools.

Your deserialization overhead space is not calculated, and also not the storage costs for the false positive check. It's rather academic, not practical


There is no deserialization time or space overhead. The measurements refer to the deserialized form. They are not loaded from disk.

About false positive checks, I think you misunderstand what a perfect hash function does.


See, everybody can see how you cheat your benchmarks. It's unrealistic to measure only the perfect hash function, when you discard the cost of the deserialization and false-positive checks.

That's always the case when dealing with you academics.

We dont care how you define your costs when we laugh about that. In reality you have to load the keys, the data structures and check for existance. You even discard the existance checks, by omitting the false-positive checks. Checking against a non-existing key will lead to a hit in your check. Only very rarely you know in advance if your key is in the set


Also, you miss the space costs for the ordering.


> using it like gperf is certainly not a special case.

Well... let's put it like this: in this survey, "parsers" (where I am one of the co-authors) are not mentioned explicitly in the "Applications" section. They are a subset of "Hash Tables and Retrieval". There are many other uses: Approximate Membership, Databases, Bioinformatics, Text Indexing, Natural Language Processing. Yes, parsers are mentioned in "The Birth of Perfect Hashing". Maybe we can conclude that parsers are not the "main" use case nowadays.

> when you compile those data structures, the load time is instant.

Well, in this case I would recommend to use a static lookup table in the form of source code. That way, it is available for the compiler at compile time, and doesn't need to be loaded and parsed at runtime: it is available as a data structure at runtime. All modern MPHF implementations in this survey support this usage. But, yes, they are not optimized for this use case: you typically don't have millions of keywords in a programming language.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: