Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It might not be ludicrous to suggest that the English letter "a" and the Russian letter "а" should be a single entity, if you don't think about it very hard.

But the English letter "c" and the Russian letter "с" are completely different characters, even if at a glance they look the same - they make completely different sounds, and are different letters. It would be ludicrous to suggest that they should share a single symbol.



If they're always supposed to look the same, then Unicode should encode them the same, even if they mean different things in different contexts.


Two counterpoints:

1. Unicode isn't a method of storing pixel or graphic representations of writing systems; it's meant to store text, regardless of how similar certain characters look.

2. What do you do about screen readers & the like? If it encounters something that looks like a little half-moon glyph that's in the middle of a sentence about foreign alphabets that reads "Por ejemplo, la letra 'c'", should it pronounce it as the English "see" or as Russian "ess"?


> 1. Unicode isn't a method of storing pixel or graphic representations of writing systems; it's meant to store text, regardless of how similar certain characters look.

I'm not sure that that is really possible without something way bigger or more complicated than Unicode. Consider the string "fart". In English that means to emit gas from the anus. In Swedish it means speed. Does that mean Unicode should have separate "f", "a", "r", and "t" for English and Swedish?

> 2. What do you do about screen readers & the like? If it encounters something that looks like a little half-moon glyph that's in the middle of a sentence about foreign alphabets that reads "Por ejemplo, la letra 'c'", should it pronounce it as the English "see" or as Russian "ess"?

What would a human do if that was in a book and they were reading it aloud for a blind friend?


For 8 minutes of this (among other translation mistakes), you've reminded me of Peggy Hill's understanding of Spanish in the cartoon King of the Hill - https://www.youtube.com/watch?v=g62A1vkSxB0

(IIRC, she learned the language entirely from books so has no idea of the correct pronunciation and thinks she's fluent)


1. "graphic representation of writing systems" and "text" mean the same thing to me. Do you mean text as spoken?

2. I think the pronunciation should not be encoded into the text representation on a general scale. You would need different encodings for "though" and "through" in english alone. Your example leaves the meaning open, even if being read as text. If I was the editor, and the distinction was important, I'd change it to "For example, the cyrillic letter 'c'".

I understand that Unicode provides different code points for same-looking characters, mostly because of history, where these characters came from different code sheets in language-specific encodings.


I mean text as in the platonic ideal of "c" and "с". Just because they look the same, does not make them the same character. If we're going to be encoding characters that happen to have pixel-identical renderings in certain fonts, the next logical step is to encode identical letters that look different in different fonts or writing styles as separate code points as well - for example, the English letter "g" is a fucking orthographic nightmare.


Imagine if, say, English people normally wrote an open ‘g’ and French normally wrote a looped ‘g’, and you have the essence of the Han Unification debates.


What about Latin "k" and Cyrillic "к"? Do they look the same in your font of choice? Should they?


Heh.

“Cyrillic” isn't the same everywhere. Bulgarian fonts differ from Russian fonts, some letters are “latinized”, some borrow from handwritten forms:

https://bg.wikipedia.org/wiki/Българска_кирилица

Colored example has the third alternative for Serbian cursive.

So without some external lang metadata we don't even know how your message should look.

However, Russian “Кк” traditionally is different from Latin “Kk” in most recognized families. In the '90s, font designers regularly thrashed ad-hoc font localization attempts which ignored the legacy of pre-digital era, and blindly copied the Latin capital into capital and minuscule forms.


Those look different, so I have no issue with them being different code points.


But they don't "fundamentally" look different, it's font dependent(there are fonts where they look the same), just like the same Latin k will look different depending on a font, so you need a better rule to make your own simple Unicode


He's probably the guy who decided to add fraktur/double-strike/sans-serif/small-caps/bold/script/etc variants of Latin letters to the Unicode because, you know, they look different! so they should get their own special code points.

It was a joke, by the way.


What about Cyrillic T: Т? It looks the same uppercase (but not lowercase. And in italic/cursive, which I believe is not encoded in Unicode, it looks sort of like an m).


The capitalized "K" and "К" look exactly the same though.


When I look at your post, in "K", the lower diagonal line branches off of the upper diagonal line, slightly breaking horizonal symmetry, but "К" is horizontally symmetrical.


The latter glyph has a little bend on the top diagonal part


Not in my font!


C vs С is so strange to me. They look the same upper and lower case, italic, cursive, even are at the same location on keyboards. It's not like W is a different character in Slavic languages that use latin script even though the sound is completely different in English.


I was thinking of Russian letter г and Ukrainian letter г.

Or the whole eh/ye flip En/UK/Ru Eh/е/э Ye/є/е

г/е are unified and that's probably as it should be but there are downsides.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: