Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This has nothing to do with UTF-8 which doesn't and shouldn't care about anything beyond mapping bytes to code points.

But even for adding it to Unicode, your proposal would make text stateful (even over long distances) which is a really bad idea.






Combining characters have already made Unicode text stateful.

Although I agree that encoding length hints into it seems like a bad idea - it creates an opportunity for the encoding to disagree with the reality of the text. You need _some_ way of handling it if it says that the next grapheme cluster is 4 characters long but it's actually only three.


It's already stateful



Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: