Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Impossible because the decision was already made by Turkish encodings, which made Unicode to pick only one option (round-trip compatibility with legacy encodings) out of possible trade-offs.


What were the other possible trade-offs? I don't really see how lack of round-trip compatibility is worse than what we have now. It's breaking the whole idea of Unicode code points and for what.


Actually it reflects the idea of Unicode code points correctly. They are meant to represent graphs, not semantics.

This isn't honored; we have many Unicode code points that look identical by definition and differ only in their secret semantics, but all of those points are in violation of the principles of Unicode. The Turkish 'i' is doing the right thing.


> Actually it reflects the idea of Unicode code points correctly. They are meant to represent graphs, not semantics.

Why do we then have lots of invisible characters that are intended essentially as semantic markers (eg, zero-width space)?


How do you define "look identical" outside of fonts which from my understanding were excluded from Unicode consideration on purpose?

E.g. Cyrillic "а" looks the same as Latin "a" most of the time, they both are distant descendants of the Phoenician 𐤀, but they are two different letters now. I'm very glad they have different code points, it would be a nightmare otherwise.


And they would call it Greco-Roman unification, similar to Han unification.


No, Han unification is a completely different thing. Unicode Han unification represents distinct glyphs as the same code point - the intent is that you choose the glyph you want by setting a font (!). This has been acknowledged as a mistake.

Having distinct code points for Latin capital letter A, Greek capital letter A, and Cyrillic capital letter A is the reverse, separate code points for glyphs that are identical by definition. That's also a mistake.

(Although it might be required by Unicode's other principle of being fully compatible with a wide variety of older encodings. There are many characters, like 囍, that don't qualify to have a code point, but that have one anyway because they're present in an encoding that Unicode commits to represent.)


No that’s the opposite of how it’s supposed to work


How would separate code point break round-tripping specifically?


The legacy Turkish encoding used ASCII i but a character in the 128–255 range for İ. Remember that not all documents are monolingual so you might have a document with, e.g., both English and Turkish text and in the legacy code page these would use i for both the English and Turkish letter.


That didn't answer the question, why would separate code point for Turkish lower-case dotted i break round-tripping?


Because just because something is in Turkish doesn’t mean it doesn’t also include non-Turkish text. So you end up with weird edge cases when translating mixed text back and forth since it would be a single glyph in legacy Turkish 8-bit text but two glyphs in Unicode so Unicode text that might have “Kırgızistan (English: Kyrgyzstan)” in it under your scheme with a Unicode-Legacy-Unicode roundtrip would encode the i in English as the Turkish dotted i.


If "turkish lower case dotted i" would be separate codepoint, that still wouldn't cause ambiguity like you describe. It would just mean that "U+0069 latin small letter i" would not be (directly) transcodable to the legacy Turkish character set. But that wouldn't really be any different from other similar homoglyph situations, for example "U+0430 cyrillic small letter a" does not transcode to ASCII and that is business as usual. U+0069 not being transcodeable to some legacy encoding is not really a round-tripping problem, vast majority of Unicode codepoints are not transcodable to any single legacy encoding. Round-trip compatibility is really only concern when going from legacy-unicode-legacy; it is naturally expected that not all strings will be able to roundtrip unicode-legacy-unicode.


EXCEPT that the legacy Cyrillic codepages had separate codepoints for Latin a and Cyrillic а. You’re also making assumptions about the roundtrip preservation that are invalid. The idea is that if a string is encodable in the legacy codepage, you should be able to make the roundtrip. Yes, you can’t roundtrip ⨋ to most legacy codepages, but that’s not the brief.


> The idea is that if a string is encodable in the legacy codepage, you should be able to make the roundtrip.

But the which strings are encodable in legacy codepage depends on what we define as encodable! If we had separate codepoint for "turkish small letter i" then we could have simply defined that "latin small letter i" is not encodable in legacy turkish codepage, same way that "cyrillic small letter a" is not encodable to turkish legacy codepage. "turkish small letter i" and "latin small letter i" would be just another normal homoglyph pair, same as "cyrillic small letter a" and "latin small letter a".


I don't know about round-tripping of anything but suddenly having an entire nation with keyboard outputting utf-8 on outdated national systems probably designed for Latin1 seems like a tough sell to fix this issue


Yep I'm aware




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: