> Plus UTF-8 has more invalid encodings to deal with than a super-simple format.
If your format supports non-canonical encodings you're in for a bad time no matter what, so a whole lot of that simplicity is fake.
> And it also means you're dealing with three classes of byte now.
If you're working a byte at a time you're doing it wrong, unless you're re-syncing an invalid stream in which case it's as simple as a continuation bit (specifically, it's two continuation bits).
The simple encoding already allows smaller characters to have the same bytes as subsets of larger characters. Non-canonical is not a big deal on top of that. Also there are other banned bytes you don't need to deal with.
> If you're working a byte at a time you're doing it wrong, unless you're re-syncing an invalid stream
It's very relevant to explaining the encoding and it matters if you're worried that invalid bytes might exist. You can't just ignore the extra complexity.
Also if you're not working a byte at a time, that kind of implies you parsed the characters? In which case non-canonical encodings are a non-problem.
If you're going beyond decoding, then you're beyond the stage where canonical and non-canonical versions exist any more.
Non-canonical encodings make it difficult to do things without decoding, but you have bigger problems to deal with in that situation, and the non-canonical encodings don't make it much worse. Don't get into that situation!
Specifically, even with only canonical encodings, one and two byte characters can appear inside the encoding of two and three byte characters. You can't do anything byte-wise at all, unlike UTF-8. But you already said "If you're working a byte at a time you're doing it wrong" so I hope that's not too big of an issue?
If your format supports non-canonical encodings you're in for a bad time no matter what, so a whole lot of that simplicity is fake.
> And it also means you're dealing with three classes of byte now.
If you're working a byte at a time you're doing it wrong, unless you're re-syncing an invalid stream in which case it's as simple as a continuation bit (specifically, it's two continuation bits).