Transcoding Unicode strings at crazy speeds with AVX-512

amluto · on Sept 13, 2023

I wish these libraries (simdutf, simdjson, etc) supported translation of possibly broken utf8 to valid utf8 with replacement of errors. IMO this ought to be available out of the box, just like Python’s decode(errors='replace'). This is extremely common — utf8 is the dominant codec, and plenty of applications receive supposedly-utf8 data from untrusted sources and need to display it or further process it without just giving up if it’s bad.

Yes, I realize that this can be done by hand using simdutf’s API. No, users should not be expected to roll it themselves.

clausecker · on Sept 13, 2023

The algorithms described in the paper all perform input validation and transcode until the first encoding error or until end of input, whichever comes first. They also tell you how much of the input they transcoded, so it's indeed quite easy to call the transcode function and then to manually fix up errors if it chokes.

We had considered adding a variant that could automatically fix up malformed input, but as there are multiple mutually incompatible conventions for dealing with encoding errors, we decided to ignore this case for now. Perhaps a future version of simdutf may provide support for best-effort transcoding of errorneous input.

amluto · on Sept 13, 2023

I can indeed "quite easily" hack this up. But this is C++, and C++ string manipulation is infamously unsafe and error-prone, and I really have very little desire to hack this up, write tests, etc, when I'd rather not think about UTF-8 at all.

For an example of how unsafe and error-prone C++ string manipulation is, from simdutf's very own documentation:

> simdutf_warn_unused size_t convert_latin1_to_utf8(const char * input, size_t length, char* utf8_output) noexcept;

Eww, this literally cannot be safe in the sense of "free from memory-safety errors when called carelessly" (like Rust or almost any GC language would expect). Look, takes a single pointer to an output array with no length or other limit. It overruns the buffer unless you get it exactly right. (ThePhD has a very on-point rant or two [0] about this, and ztd.text IMO (and in his opinion) gets this right.) Also, for all that simdutf is really quite fast, it's really not obvious how to transcode latin-1 to utf-8 in a single pass, because you apparently need to compute the length first.

I think one of the great strengths of safer languages is that they fight back when you try to design an API like this.

[0] https://thephd.dev/output-ranges

clausecker · on Sept 13, 2023

Simutf is designed for performance. The function is designed for you to provide a buffer of sufficient length. It will return the number of bytes it actually needed. You can call utf8_length_from_latin1 to find the required buffer length, or provide a buffer that is twice as large as the input buffer, which will always suffice. We intend that users build high-level APIs around these functions if so desired.

> Also, for all that simdutf is really quite fast, it's really not obvious how to transcode latin-1 to utf-8 in a single pass, because you apparently need to compute the length first.

You do this by having the caller pre-allocate a buffer that will always be long enough (twice the length of the input). Some bytes of the buffer may end up being unused, but that's usually acceptable.

amluto · on Sept 13, 2023

> Simutf is designed for performance.

I know. When I was less experienced, I believed more in this kind of isolated performance. But...

> You can call utf8_length_from_latin1 to find the required buffer length

For smallish input (that fits in L1 cache, for example), performance might be dominated by the cost of filling the cache, in which case two passes might not be that bad. For large inputs, especially at the kinds of speeds that simdutf8 achieves, memory bandwidth is likely to dominate or come close, and two passes is not so great. In either case, I bet that a high quality length-checked implementation of a function like convert_latin1_to_utf8 could be considerably faster than calling utf8_length_from_latin1 and then convert_latin1_to_utf8.

> or provide a buffer that is twice as large as the input buffer, which will always suffice.

and will always be fast as long as wasting up to half the memory is not a big deal. But in many cases, that factor of 2 of memory usage is a big deal.

So I hate to say it, but I tend to think that simdutf's transcoding design is based on benchmarking the wrong thing. (The validation benchmark design is fine.)

RaisingSpear · on Sept 14, 2023

> wasting up to half the memory

OSes generally implement overcommit - allocating a bunch of memory doesn't mean it's actually being used.

amluto · on Sept 14, 2023

Sorry, but allocating approximately 2x the needed size for each string is not going to get covered up well by overcommit unless you are directly mmapping and munmapping (or DONTNEEDing or similar) each allocation and each allocation is more than 8kB. (Probably well over 8kB, and for actual good efficiency in the presence of huge pages, over 4MB so there’s a chance that half of it is unused.)

I doubt this workload exists. Overcommit is kind of nifty but not at all magical.

I bet a really good AVX-512 transcoder with output size checking could be almost as fast as the OP.

RaisingSpear · on Sept 15, 2023

> unless you are directly mmapping and munmapping (or DONTNEEDing or similar) each allocation

You don't free up the memory you're not using? Sounds like you're doing it wrong then.

> I bet a really good AVX-512 transcoder with output size checking could be almost as fast as the OP.

Submit your pull request and we can evaluate your bet.

clausecker · on Sept 13, 2023

Author here. Please let me know if you have any questions.

k0k0r0 · on Sept 14, 2023

Great to see you here!

shanielh · on Sept 13, 2023

Funny thing I just published a PR to RoaringBitmap and got answered by him. Small world.

pvg · on Sept 13, 2023

discussed recently https://news.ycombinator.com/item?id=37197921

clausecker · on Sept 13, 2023

That article was about a Latin 1 to UTF-8 routine, while this one is about UTF-8 from/to UTF-16.