Adding UTF-8 support to the Kilo editor using AI and human hints [video]

deathanatos · 2025-01-09T18:30:53 1736447453

IDK. It is impressive how far it sort of gets with direction, but this is to me sort of another example of "you have to know the LLM is wrong in order to get the right output" (at which point you're 98% of the way to writing the code yourself). I want tooling that just computes the right answer, not tooling that has to be coerced specifically into an answer that is wrong in only ways I haven't yet detected.

For the concrete example posed in the video: the code-point decoding routine: are we assuming the internal char *s¹ enforce the guarantee of UTF-8 well-formed-ness? If they do, the code written should simply SIGABRT if it detects such a violation; if it isn't, we should probably return an error. The LLM's code opts to return a [hot garbage] value.

Where we're dealing with ZWJs … we're still also mixing in UTF-8 decoding concerns: separating out "parse UTF-8" from "figure out what USVs make up a grapheme" is the approach you want, but the output is spaghetti. It might be right spaghetti, but a video is hard to code-review, and "right spaghetti" isn't what I want from devs or LLMs cosplaying as devs.

There were some other feels, like you probably want to move code detecting the type of code point into a table & use something data-driven. You would then probably want to generate that table from the actual Unicode source data files … not hope the LLM's very magical looking code point numbers are correct? (spoiler: they're not.)

That said, … it probably fares about as well as most devs attempting to handle Unicode in a language lacking any support at all like C would.

This doesn't get into the other nightmare that for a given sequence of UTF-8 bytes emitted to a terminal "the number of columns this will take to display" and "the number of columns the cursor will advance" aren't always the same number.

¹(char is a nightmare type for UTF-8 data. Bytes are nigh always represented in unsigned values — including by the LLM —, and using a quasi-signed type is just asking for bugs. Amazingly, the LLM appears to get it right here, but UTF-8 data is unsigned char* to me. That doesn't play well with most of C's stdlib, but most of C's stdlib isn't that bright anyways.)