> Plus UTF-8 has more invalid encodings to deal with than a super-simple format....

Dylan16807 · on Jan 1, 2024

The simple encoding already allows smaller characters to have the same bytes as subsets of larger characters. Non-canonical is not a big deal on top of that. Also there are other banned bytes you don't need to deal with.

> If you're working a byte at a time you're doing it wrong, unless you're re-syncing an invalid stream

It's very relevant to explaining the encoding and it matters if you're worried that invalid bytes might exist. You can't just ignore the extra complexity.

Also if you're not working a byte at a time, that kind of implies you parsed the characters? In which case non-canonical encodings are a non-problem.

morelisp · on Jan 1, 2024

> Non-canonical is not a big deal on top of that.

Unless you want to actually do anything with the string beyond decode a codepoint.

Dylan16807 · on Jan 1, 2024

If you're going beyond decoding, then you're beyond the stage where canonical and non-canonical versions exist any more.

Non-canonical encodings make it difficult to do things without decoding, but you have bigger problems to deal with in that situation, and the non-canonical encodings don't make it much worse. Don't get into that situation!

Specifically, even with only canonical encodings, one and two byte characters can appear inside the encoding of two and three byte characters. You can't do anything byte-wise at all, unlike UTF-8. But you already said "If you're working a byte at a time you're doing it wrong" so I hope that's not too big of an issue?