Hacker News new | past | comments | ask | show | jobs | submit login

    In order to maintain backwards compatibility with existing
    documents, the first 256 characters of Unicode are identical to
    ISO 8859-1 (Latin 1).
This isn't true in a useful sense. It does look like it's true in Unicode codepoint space [1] but in any specific encoding of Unicode it can't be the case because latin1 uses all 0-255 byte values. For example, in utf8 it's only an exact overlap for bytes 0-127 (7 bit ascii).

(Though maybe this means you could convert latin1 to utf-16 by interleaving null bytes with the latin1 bytes?)

[1] https://en.m.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_...




> (Though maybe this means you could convert latin1 to utf-16 by interleaving null bytes with the latin1 bytes?)

Yes. In fact, things like JS JITs end up storing strings as either UTF-16 strings or Latin1 strings internally to take advantage of this fact.


JavaScript uses (used, until a recent version) UCS-2, not UTF-16!


Most JavaScript implementations have a bunch of different string types used internally, depending on what you're doing with the string. In-memory representation has no bearing on the API visible to the outside world.

And while the JavaScript APIs only allow you to deal with UCS-2, the string contents themselves are, in fact, usually UTF-16.


It is useful for hacks in languages or APIs that don't distinguish between uint8[] and unicode string types. When you need to handle binary data you can create a string with unicode codepoints 0-255 and pass it to various IO things as latin-1 to create the byte sequences you want.

And of course it also works in the other direction. To safely read binary data into unicode strings just decode as latin-1 instead of utf8 and you won't run into validation errors since all byte sequences are valid in latin-1 while not all are in utf8.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: