To anybody unfamiliar with recent progress, boost::unordered_flat_map is the fastest open-addressing hash map implementation in this field for C++. Coming out just several months ago, it outperforms absl::flat_hash_map and various other implementations benchmarked in https://martin.ankerl.com/2022/08/27/hashmap-bench-01/.
It's a little hard to take seriously some of the top performers in that table, like emhash8, when the implementations are only a couple of weeks old and the commits are things like "fix crash", "fix clear", and "fix overflow".
I think I need more context for whet you’re trying to say here. I understand “web scale” in the marketing term Mongo sense but I don’t know how that applies to this
Imagine you could interact with an object that had the API of all data structures. And it looked like a HashMap if you ran hash operations on it or like a linked list if you interacted with it that approach.
Behind the scenes the compiler shall analyse all your calls to that data structure and see what APIs you use to interact with it.
It can then decide which algorithms to use to optimise your interaction with that data structure.
If only computers could do hot/cold analysis. It could detect that you call a hash lookup in a hot for loop but only call the ordered map iterator from the admin screen and outside a loop.
I would then JIT compile the right data structure algorithm used from a bucket of implemented algorithms. So the data is kept optimised for its usage patterns.
Decisions such as use tombstones, compaction, vacuum, LSM trees, btrees, ordered, unordered, LSM trees with separate value storage, a growable array, a std::list or vector. All depend on what your intended access patterns to the data is.
Could also do clever materialized views. If you only use one view of the data rarely, it isn't efficient to maintain that view.
PHP's arrays are a bit like this. You can treat them either like an (ordered) hashmap or like a growable array. The underlying representation will adapt depending on whether the current state can be represented as an array.
And I hate it! I'm always unsure about an array's contents. I write functions that expect to operate on linear arrays and might misbehave if given a map, or vice versa. It's rare for the conflation to be helpful. Arrays and maps have different meanings and are used in different ways.
Maybe a very mild version of this would be helpful, like automatically converting an array to an array-backed deque—though even that case seems better handled by a linter making the suggestion, so the programmer can make an informed decision.
Going all the way would be terrible. Types are helpful. Even in a dynamically typed language they give you a vocabulary.
I'm not convinced it's a good idea in general because usually the author will have a good idea about how the data will be accessed and what kind of data there is so if you can choose up front you may as well.
Plus having runtime analysis like that can lead to weird and unpredictable performance characteristics.
I have tried swiss tables but I have been unable to make these match the performance of robin hood tables with a single-stage lookup. Fundamentally the issue seems to be that swiss tables require one to read 2 or 3 cache lines while the alternative requires one to read only 1 cache line per lookup (most of the time). The included benchmarks are not that useful for inserting many items or looking up many items because one can achieve much higher throughput using prefetching (like <<10ns/op), but this sort of consideration never makes it into any graphs.
I wish that such benchmarks included bad old std::map, too, so that I could beat my stubborn coworker with them until he finally started using hash maps by default.
Related: this talk https://www.youtube.com/watch?v=ncHmEUmJZf4 about the making of absl::flat_hash_map (a "competitor" to boost::unordered_flat_map) is one of my favorites.
I have a question about time complexity for hash maps. What is considered the complexity of inserting an initially unknown number n of items into the map?
My longstanding belief was that this complexity should be O(n log n) because a particular hash size will be chosen and then have to be repeatedly grown as the density of the map's population overwhelms the current size of the structure. An exponentially increasing size gives growth steps every log n items, and the growth step will require O(n) time.
Now everyone seems to believe that hash maps insertion is O(1) time with the only exception being degenerate coincidence between hash function and hashed data that puts large number of items into the same bucket.
Do current hash maps avoid O(n) complexity when growth is required? How?
Sizing the hash map appropriately is one of the tuning knobs to usually achieve the O(1) desired complexity, in theory you are right, but tuning away from that in practice this does not or rare enough happens is the goal why you actually chose a hash map for hopefully close to O(1)?
Sometimes I wonder why these nifty data structures that someone already implemented in c++ don't immediately (or even 5+ years later) have ports to other languages. Especially languages that are closer to the metal than average (e.g. Rust, Go, ...)
What sort of languages are you thinking of which are "closer to the metal" than C++ but support the features you'd want for these structures ?
The closest I can think of is Rust, but it does have ports of such things. Rust's HashMap is a Swiss Table - Abseil's flash_hash_map but ported to Rust (via HashBrown), Haphazard is a Rust port of the Folly library's hazard pointers (which might some day be C++ 26 or C++ 29 Hazard Pointers)
perhaps mkoubba meant “closer to the metal” than the average programming language, like, say Swift or even C#, not “closer to the metal” than C++. Otherwise it the comment makes little sense, as I would not even consider Rust “closer to the metal” than C++. Maybe C or x86 Assembly?
There are different semantics for these data structures in different languages. These tend to exist for corner cases, but those corner cases constrain your design.
So you might not be able to replace the default Java hashmap with absl::flat_hash_map, but you can steal ideas and maybe do better in other ways where the C++ map would be constrained. However, doing that is work that someone has to pick up.
And releasing a (e.g.) Go library that breaks from the built in map semantics might not see much adoption in that community. C++ actually seems a bit unique in terms of how many widely used implementations there are of key data structures (e.g. strings, maps, trees, inline vectors). Not criticizing just making an observation.
This is a really good point. C++ developers seem to be more opinionated than other communities about data structure semantics in my experience. Possibly because the language chose the wrong semantics for the original standard library, whereas other languages made more sensible choices
If you want to see utter hell on earth, check out CGAL. Even at my peak C++ TMP days it was bad, being 10 years out of C++ completely, it’s vomit inducing.
I still point to wide adoption of boost:: as reason why C++ should never be used. If 4 pages of a single error message isn't enough to convince someone, nothing will.
Those 4 pages of errors are useful as hell though. Some languages are better at giving you just the important bits (Rust) but a lot of languages suffer from "simple errors obscuring complex problems" syndrome.
C++ compile errors, particularly overloading and template errors are normally just some fairly tiny succinct errors (could not find a suitable method/type/operator) with a whole bunch of debug information attached that tells you exactly what the compiler tried and how it came to the conclusion that it could not proceed.
I can take that information and work out exactly why my code is failing to compile 99% of the time but it's rather often in other languages that I get a succinct 1-3 line error that provides little to no context beyond where in the code it is. That leaves me to dig through stack overflow in the hopes that I can find someone with the same error and a similar code layout such that I can determine what is inducing the error.
Don't get me wrong, C++ has a whole lot of issues which justify why things like Rust should be preferred wherever possible but the errors aren't nearly as bad as people make them out to be. I'm much more concerned with the cancer that are C & C++ arithmetic rules(https://twitter.com/hackingcpp/status/1492242039110524935). Likewise for the numerous ways you can accidentally introduce undefined behavior when interacting with non "new C++" code (and even sometimes then as well).
Honestly I have to disagree. There is nothing particularly special about what Super Template Tetris(STT) is doing.
At its core, template metaprogramming is just functional programming at compile time. STT is just a template and a runtime function which do the following:
1, take an input via compile time flag (the `-D DIRECTION`)
2. take a type input from an included header file containing the current state (`#include "current_game.h"`)
3. via functional programming, compute the results of a single step of the game.
4. specialise a single function using the results of step 3. this function prints the computed result to the screen and the computed game state to a file (`./current_game.h`).
5. gcc/clang exits. compilation is complete.
6. call the compiled binary.
7. the binary runs the specialised function and prints the outputs.
Sure it's fucky and you shouldn't do that in production but what sane individual is writing a piece of code that at runtime (after compiling) seeks out one of its own source files and modifies that file?
To prevent this from being possible you'd have to remove runtime file IO from the language. The other potential solutions wouldn't work:
1. Remove templates entirely: Still would be possible using https://github.com/Hirrolot/metalang99 which solely uses the preprocessor. Given that the pre-processor is literally just term substitution(a glorified copy/paste engine), if you removed that as well, you'd have to accept no form of metaprogramming at all.
2. Remove the ability to #include other files: Could still be done by doing everything inline. `#include` is just copy-paste anyways so it's more an abstraction than anything else to the compiler and preprocessor, it's basically the same as if all the code was pasted into the same file.
That leaves you with removing file IO. Without IO a programming language is basically useless, particularly as a systems programming language.
Most production C++ codebases are very wary of using much boost. Two of the four places I worked at (wrote C++ at all of them), the use of boost was very minimal and only when absolutely needed. The first place where this was the rule, the only boost usage was boost.asio, at my current place boost.circular_buffer is the only thing we use.
who cares about 4 pages of error message ? it's one double-click away in my IDE to go to the line in my code that has the issue whether it's one line or 500. this is an absolute non-problem. and if you don't have an IDE just ctrl-f the magic words "required from here" (with gcc) or "requested here": the last one will point to the place in your code where the issue comes from