Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The UTF-8 encoding is designed so that this is usually not a problem. If you do a search in a utf-8 encoded byte array for an ascii character, for example, you can never get a false positive. Compound UTF-8 characters always have the most significant bit set of each component byte, and ascii characters always have it unset. Additionally, treating the string as an array of unicode codepoints doesn't solve the problem -- now you have people screwing around with individual codepoints inside grapheme clusters :P


> Additionally, treating the string as an array of unicode codepoints

I suggested no such thing.

> individual codepoints inside grapheme clusters

That's less severe than invalid codepoints.

Perhaps the whole thing whichever way it is represented should not be mutable given that there's no way to make it mutable in a sensible way?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: