The UTF-8 encoding is designed so that this is usually not a problem. If you do a search in a utf-8 encoded byte array for an ascii character, for example, you can never get a false positive. Compound UTF-8 characters always have the most significant bit set of each component byte, and ascii characters always have it unset. Additionally, treating the string as an array of unicode codepoints doesn't solve the problem -- now you have people screwing around with individual codepoints inside grapheme clusters :P