Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

IMO, it was a mistake for Unicode to provide multiple ways to represent 100% identical-looking characters. After all, ASCII doesn't have separate "c"s for "hard c" and "soft c".


The problem in the linked article barely scratches the surface of the issue. You _cannot_ compare Unicode strings for equality (or sort them) without locale information. A simple example: to a Swedish or Finnish speaker, o and ö completely different letters, as distinct as a is from b, and ö sorts at the very end at the alphabet. A user that searches for ö will definitely not expect words with o to appear. However, to an American, a user that searches for "cooperation" when your text data happens to include writings by people who write like in The New Yorker, would probably to expect to find "coöperation".

This rabbit hole goes very, very deep. In Dutch, the digraph IJ is a single letter. In Swedish, V and W are considered the same letter for most purposes (watch out, people who are using the MySQL default utf8_swedish_ci collation). The Turkish dotless i (ı) in its lowercase form uppercases to a normal I, which then does _not_ lowercase back to a dotless i if you're just lowercasing naively without locale info. In Danish, the digraph aa is an alternate way of writing å (which sorts near the end of the alphabet). Hungarian has a whole bunch of bizarre di- and trigraphs IIRC. Try looking up the standard Unicode algorithm for doing case insensitive equality comparison by the way; it's one heck of a thing.

People somehow think that issues like these are only an issue with Han unification or something, but it's all over European languages as well. Comparing strings for equality is a deeply political issue.


> Hungarian has a whole bunch of bizarre di- and trigraphs IIRC

Actually, there is just only one trihraph. "dzs" almost exclusively used for representing "j" from English and other alphabets, for example "Jennifer" is "Dzsennifer" in Hungarian or "jam" is "dzsem" in the same way.

Trigraph and digraphs actually make sense, at least as a native as these really mark similar sounds what you would think you will get by combining the given graphs. These letters doesn't cause too much issues in search in my opinion, but hyphenation is a form of art (see "magyar.ldf" for LaTeX as an example).

To complicate the situation even further we have a/á, e/é, i/í and o/ó/ü/ő and u/ú/ü/ű letters, all of those considered to be separate ones and you can easily type them in a Hungarian desktop keyboard. On the other hand, mobile virtual keyboards usually show a QWERTY/QWERTZ layout where you can only find "long vowels" by long pressing their "short" counterparts, so when you are targeting mobile users you maybe want to differentiate between "o" and "ö", but not between "o" and "ó" nor between "ö" and "ő".


That doesn't seem that strange Russian and I think Ukrainian (maybe some other languages that use cyrilic) have Дж as the closest thing to English J. Д is d and ж is transliterated as zh. Sometimes names are transliterated with dzh instead of j.


> to an American, a user that searches for "cooperation" when your text data happens to include writings by people who write like in The New Yorker, would probably to expect to find "coöperation".

Unicode shouldn't be responsible for making such searches work, just like it's not responsible for making searches for "analyze" match text that says "analyse".


My point was simply that the fact that there are multiple representations of characters that look the same is just a tiny part of the complexity involved in making text behave like users want. It's not that uncommon for people to think that "oh I'll just normalize the string and that'll solve my problems", but normalization is just a small part of quote-unquote "proper" Unicode handling.

The "proper" way of sorting and comparing Unicode strings is part of the standard; it's called the Unicode Collation Algorithm (https://unicode.org/reports/tr10/). It is unwieldy to say the least, but it is tuneable (see the "Tailoring" part) and can be used to implement o/ö equivalence if desired. I think it's great that this algorithm (and its accompanying Common Locale Data Repository) is in the standard and maintained by the consortium, because I definitely wouldn't want to maintain those myself.


Unicode was never designed for ease of use or efficiency of encoding, but for ease of adoption. And that meant that it had to support lossless round trips from any legacy format to Unicode and back to the legacy format, because otherwise no decision maker would have allowed to start a transition to Unicode for important systems.

So now we are saddled with an encoding that has to be bug compatible with any encoding ever designed before.


If you take a peek at an extended ASCII table (like the one at https://www.ascii-code.com/), you'll notice that 0xC5 specifies a precomposed capital A with ring above. It predates Unicode. Accepting that that's the case, and acknowledging that forward compatibility from ASCII to Unicode is a good thing (so we don't have any more encodings, we're just extending the most popular one), and understanding that you're going to have the ring-above diacritic in Unicode anyway... you kind of just end up with both representations.


Everything can just be pre-composed; Unicode doesn't need composing characters.

There's history here, with Unicode originally having just 65k characters, and hindsight is always 20/20, but I do wish there was a move towards deprecating all of this in favour of always using pre-composed.

Also: what you linked isn't "ASCII" and "extended ASCII" doesn't really mean anything. ASCII is a 7-bit character set with 128 characters, and there are dozens, if not hundreds, of 8-bit character sets with 256 characters. Both CP-1252 and ISO-8859-1 saw wide use for Latin alphabet text, but others saw wide use for text in other scripts. So if you give me a document and tell me "this is extended ASCII" then I still don't know how to read it and will have to trail-and-error it.

I don't think Unicode after U+007F is compatible with any specific character set? To be honest I never checked, and I don't see in what case that would be convenient. UTF-8 is only compatible with ASCII, not any specific "extended ASCII".


In my opinion, only the reverse could be true, i.e. that Unicode does not need pre-composed characters because everything can be written with composing characters.

The pre-composed characters are necessary only for backwards compatibility.

It is completely unrealistic to expect that Unicode will ever provide all the pre-composed characters that have ever been used in the past or which will ever be desired in the future.

There are pre-composed characters that do not exist in Unicode because they have been very seldom used. Some of them may even be unused in any language right now, but they have been used in some languages in the past, e.g. in the 19th century, but then they have been replaced by orthographic reforms. Nevertheless, when you digitize and OCR some old book, you may want to keep its text as it was written originally, so you want the missing composed characters.

Another case that I have encountered where I needed composed characters not existing in Unicode was when choosing a more consistent transliteration for languages that do not use the Latin alphabet. Many such languages use quite bad transliteration systems, precisely because whoever designed them has attempted to use only whatever restricted character set was available at that time. By choosing appropriate composing characters it is possible to design improved transliterations.


> It is completely unrealistic to expect that Unicode will ever provide all the pre-composed characters that have ever been used in the past or which will ever be desired in the future.

I agree it's unlikely this will ever happen, but as far as I know there aren't really any serious technical barriers, and from purely a technical point of view it could be done if there was a desire to do so. There are plenty of rarely used codepoints in Unicode already, and while adding more is certainly an inconvenience, the status quo is also inconvenient, which is why we have one of those "wow, I just discovered Unicode normalisation!" (and variants thereof) posts on the front-page here every few months.

Your last paragraph can be summarize as "it makes it easier to innovate with new diacritics". This is actually an interesting point – in the past anyone could "just" write a new character and it may or may not get any uptake, just as anyone can "just" coin a new word. I've bemoaned this inability to innovate before. That is not inherent to Unicode but computerized alphabets in general, and I that composing characters alleviates at least some of that is probably the best reason I've heard for favouring compose characters.

I'm actually also okay with just using composing characters and deprecating the pre-composed forms. Overall I feel that pre-composed is probably better, partly because that's what most text currently uses and partly because it's simpler, but that's the lesser issue – the more important one that it would be nice to move towards "one obviously canonical" form that everything uses.


There is also another reason that makes the composing characters very convenient right now.

Many of the existing typefaces, even some that are quite expensive, do not contain all the pre-composed characters defined by Unicode, especially when those characters have been added in more recent Unicode versions or when they are used only in languages that are not Western European.

The missing characters can be synthesized with composing characters. The alternatives, which are to use a font editor to add characters to the typeface or to buy another more complete and more expensive version of the typeface, are not acceptable or even possible for most users.

Therefore the fact that Unicode has defined composing characters is quite useful in such cases.


Every avenue opens inconveniences for someone, but I'd rather choose the relatively rare inconvenience of font designers over the relatively common inconvenience of every piece of software ever written. Especially because this can be automated in font design tools, or even font formats itself.


For roundtripping e.g. https://en.wikipedia.org/wiki/VSCII you do need both composing characters and precomposed characters.


> I don't think Unicode after U+007F is compatible with any specific character set?

The ‘early’ Unicode alphabetic code blocks came from ISO 8859 encodings¹, e.g. the Unicode Cyrillic block follows ISO 8859-5, the Greek and Coptic block follows ISO 8859-7, etc.

¹ https://en.wikipedia.org/wiki/ISO/IEC_8859


> Unicode doesn't need composing characters

But it does, IIRC, for both Bengali and Telugu.


Only because they chose to do it like that. It doesn't need to.


Considering that Unicode did not invent combining diacritics, it follows that simple compatibility with existing encodings demanded it. Now that Unicode's goals have expanded beyond simply representing what already exists, precomposed characters would be too limiting.


It might not be ludicrous to suggest that the English letter "a" and the Russian letter "а" should be a single entity, if you don't think about it very hard.

But the English letter "c" and the Russian letter "с" are completely different characters, even if at a glance they look the same - they make completely different sounds, and are different letters. It would be ludicrous to suggest that they should share a single symbol.


If they're always supposed to look the same, then Unicode should encode them the same, even if they mean different things in different contexts.


Two counterpoints:

1. Unicode isn't a method of storing pixel or graphic representations of writing systems; it's meant to store text, regardless of how similar certain characters look.

2. What do you do about screen readers & the like? If it encounters something that looks like a little half-moon glyph that's in the middle of a sentence about foreign alphabets that reads "Por ejemplo, la letra 'c'", should it pronounce it as the English "see" or as Russian "ess"?


> 1. Unicode isn't a method of storing pixel or graphic representations of writing systems; it's meant to store text, regardless of how similar certain characters look.

I'm not sure that that is really possible without something way bigger or more complicated than Unicode. Consider the string "fart". In English that means to emit gas from the anus. In Swedish it means speed. Does that mean Unicode should have separate "f", "a", "r", and "t" for English and Swedish?

> 2. What do you do about screen readers & the like? If it encounters something that looks like a little half-moon glyph that's in the middle of a sentence about foreign alphabets that reads "Por ejemplo, la letra 'c'", should it pronounce it as the English "see" or as Russian "ess"?

What would a human do if that was in a book and they were reading it aloud for a blind friend?


For 8 minutes of this (among other translation mistakes), you've reminded me of Peggy Hill's understanding of Spanish in the cartoon King of the Hill - https://www.youtube.com/watch?v=g62A1vkSxB0

(IIRC, she learned the language entirely from books so has no idea of the correct pronunciation and thinks she's fluent)


1. "graphic representation of writing systems" and "text" mean the same thing to me. Do you mean text as spoken?

2. I think the pronunciation should not be encoded into the text representation on a general scale. You would need different encodings for "though" and "through" in english alone. Your example leaves the meaning open, even if being read as text. If I was the editor, and the distinction was important, I'd change it to "For example, the cyrillic letter 'c'".

I understand that Unicode provides different code points for same-looking characters, mostly because of history, where these characters came from different code sheets in language-specific encodings.


I mean text as in the platonic ideal of "c" and "с". Just because they look the same, does not make them the same character. If we're going to be encoding characters that happen to have pixel-identical renderings in certain fonts, the next logical step is to encode identical letters that look different in different fonts or writing styles as separate code points as well - for example, the English letter "g" is a fucking orthographic nightmare.


Imagine if, say, English people normally wrote an open ‘g’ and French normally wrote a looped ‘g’, and you have the essence of the Han Unification debates.


What about Latin "k" and Cyrillic "к"? Do they look the same in your font of choice? Should they?


Heh.

“Cyrillic” isn't the same everywhere. Bulgarian fonts differ from Russian fonts, some letters are “latinized”, some borrow from handwritten forms:

https://bg.wikipedia.org/wiki/Българска_кирилица

Colored example has the third alternative for Serbian cursive.

So without some external lang metadata we don't even know how your message should look.

However, Russian “Кк” traditionally is different from Latin “Kk” in most recognized families. In the '90s, font designers regularly thrashed ad-hoc font localization attempts which ignored the legacy of pre-digital era, and blindly copied the Latin capital into capital and minuscule forms.


Those look different, so I have no issue with them being different code points.


But they don't "fundamentally" look different, it's font dependent(there are fonts where they look the same), just like the same Latin k will look different depending on a font, so you need a better rule to make your own simple Unicode


He's probably the guy who decided to add fraktur/double-strike/sans-serif/small-caps/bold/script/etc variants of Latin letters to the Unicode because, you know, they look different! so they should get their own special code points.

It was a joke, by the way.


What about Cyrillic T: Т? It looks the same uppercase (but not lowercase. And in italic/cursive, which I believe is not encoded in Unicode, it looks sort of like an m).


The capitalized "K" and "К" look exactly the same though.


When I look at your post, in "K", the lower diagonal line branches off of the upper diagonal line, slightly breaking horizonal symmetry, but "К" is horizontally symmetrical.


The latter glyph has a little bend on the top diagonal part


Not in my font!


C vs С is so strange to me. They look the same upper and lower case, italic, cursive, even are at the same location on keyboards. It's not like W is a different character in Slavic languages that use latin script even though the sound is completely different in English.


I was thinking of Russian letter г and Ukrainian letter г.

Or the whole eh/ye flip En/UK/Ru Eh/е/э Ye/є/е

г/е are unified and that's probably as it should be but there are downsides.


Maybe, but then you can no longer round trip with other encodings, which seems worse to me.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: