Can we please talk about Unicode without the myth of Han Unification being bad s...

lmm · on March 24, 2024

> Can we please talk about Unicode without the myth of Han Unification being bad somehow?

It's not a myth, as anyone living in Japan knows, and the "just use Unicode, all you need is Unicode" dogma is really harmful; a lot of "international" software has become significantly worse for Japanese users since it took hold.

> The problem here is exactly the lack of unification in Roman alphabets!

Problems caused by failing to unify characters that look the same do not mean it was a good idea to unify characters that look different!

jhanschoo · on March 25, 2024

> "just use Unicode, all you need is Unicode" dogma is really harmful; a lot of "international" software has become significantly worse for Japanese users since it took hold.

The alternative would be that the software used Shift_JIS with a Japanese font. If the software used a Japanese font for Japanese it wouldn't need metadata anyway.

There really isn't a problem with Han unification as long as you always switch to a font appropriate for your language; you don't need to configure metadata. If you don't you are always going to run into missing codepoint problems.

In cases where the system or user configures the font, properly using Unicode is still easier than configuring alternate encodings for multiple languages.

lmm · on March 25, 2024

> The alternative would be that the software used Shift_JIS with a Japanese font.

As far as I know all Shift_JIS fonts are Japanese; you would have to be wilfully perverse to make one that wasn't.

> If the software used a Japanese font for Japanese it wouldn't need metadata anyway.

If it just uses the system default font for that encoding, as almost all software does, then it will also behave correctly.

> There really isn't a problem with Han unification as long as you always switch to a font appropriate for your language

Right. But approximately no software does that, because if you don't do it then your software will work fine everywhere other than Japan, and even in Japan it will kind-of-sort-of work to the point that a non-native probably won't notice a problem.

> In cases where the system or user configures the font, properly using Unicode is still easier than configuring alternate encodings for multiple languages.

I'm not convinced it is. Configuring your software to use the right font on a Unicode system is, as far as I can see, at least as hard as configuring your software to use the right encoding on a non-Unicode system. It just fails less obviously when you don't, particularly outside Japan.

jhanschoo · on March 27, 2024

> Right. But approximately no software does that, because if you don't do it then your software will work fine everywhere other than Japan, and even in Japan it will kind-of-sort-of work to the point that a non-native probably won't notice a problem.

Most games that I know of that target CJK + English (and are either CJK-developed, or have a local publisher based in East Asia) do indeed switch fonts depending on language (and on TC vs. SC).

> I'm not convinced it is. Configuring your software to use the right font on a Unicode system is, as far as I can see, at least as hard as configuring your software to use the right encoding on a non-Unicode system. It just fails less obviously when you don't, particularly outside Japan.

I'm considering 3 scenarios:

1. You are configuring for the Japanese-speaking market. In which case, fix a font, or fonts.

2. You are localizing into multiple languages and care about localization quality. In which case, yes, you need to know that localization in Unicode is more than just replacing content strings, but this is comparable to dealing with multiple encodings.

3. You are localizing into multiple languages and do not care about localization quality, or Japanese is not a localization target. In which case Japanese (user input / replaced strings) in your app / website will appear childish and shoddy, but it is still a better experience than mojibake.

In any case, it seems to me that it is not a worse experience than pre-Unicode. It's just that people who have no experience in localization expect Unicode systems to do things it cannot do by just replacing strings. You indeed frequently run into issues even in European languages if you just think it's a matter of replacing strings.

GoblinSlayer · on March 25, 2024

Japanese programs aren't globalized and already rely on the system being fine tuned for Japanese, so default font is already correct.

lmm · on March 25, 2024

> Japanese programs aren't globalized and already rely on the system being fine tuned for Japanese

Right, because unicode-based systems don't work well in Japan. E.g. a unicode-based application framework that ships its own font and expects to use it will display ok everywhere that's not Japan. So Japan is increasingly cut off from the paradigms that the rest of the world is using.

GoblinSlayer · on March 26, 2024

Custom fonts are often a mistake for any language, especially google fonts often look wrong. Due to this browsers often have an option to force usage of system fonts and set minimum size to improve readability.

lmm · on March 26, 2024

> Custom fonts are often a mistake for any language, especially google fonts often look wrong.

Be that as it may, the overwhelming majority of unicode fonts are dramatically wrong for Japanese and not dramatically wrong for other languages.

> Due to this browsers often have an option to force usage of system fonts and set minimum size to improve readability.

Such options are shrinking IME. E.g. Electron is built on browser internals, but does it offer that option?

eviks · on March 25, 2024

Would it still be harmful if language tag were used?

lmm · on March 25, 2024

If the tag mechanism was used consistently and handled by all software, no. But in practice the only way that would happen is if the tag mechanism was required for many languages. Unicode is, in practice, a system that works the same way for ~every human language except Japanese, which makes it much worse than the previous "byte stream + encoding" system where any program written to support anything more than just US English would naturally work correctly for every other language, including Japanese.

renhanxue · on March 25, 2024

> Unicode is, in practice, a system that works the same way for ~every human language except Japanese

This is simply not true. As I've pointed out in a sibling comment, Unicode has a lot of surprising and frustrating behaviors with many European languages as well if you use it without locale data. The characters will look right, but e.g. searching, sorting and case-insensitive comparisons will not work as expected if the application is not locale aware.

lmm · on March 25, 2024

> The characters will look right, but e.g. searching, sorting and case-insensitive comparisons will not work as expected

This is quite a different situation from Japan. A lot of applications don't do searching, sorting, or case-insensitive comparisons, but virtually every application displays text.

earthboundkid · on March 29, 2024

> It's not a myth, as anyone living in Japan knows

I lived in Japan. It is a myth. :-¥

renhanxue · on March 24, 2024

Both problems are missing the point: you cannot handle Unicode correctly without locale information (which needs to be carried alongside as metadata outside of the string itself).

To a Swede or a Finn, o and ö are different letters, as distinct as a and b (ö sorts at the very end at the alphabet). A search function that mixes them up would be very frustrating. On the other hand, to an American, a search function that doesn't find "coöperation" when you search for "cooperation" is also very frustrating. Back in Sweden, v and w are basically the same letter, especially when it comes to people's last names, and should probably be treated the same. Further south, if you try to lowercase an I and the text is in Turkish (or in certain other Turkic languages), you want a dotless i (ı), not a regular lowercase i. This is extremely spooky if you try to do case insensitive equality comparisons and aren't paying attention, because if you do it wrong and end up with a regular lowercase i, you've lost information and uppercasing again will not restore the original string.

There are tons and tons of problems like this in European languages. The root cause is exactly the same as the Han unification gripes: Unicode without locale information is not enough to handle natural languages in the way users expect.

eviks · on March 25, 2024

> which needs to be carried alongside as metadata outside of the string itself

Why not as data tagged with the appropriate language?

https://www.unicode.org/faq/languagetagging.html

renhanxue · on March 25, 2024

If you mean in-band language tagging inside the string itself, the page you're linking to points out that this is deprecated. The tag characters are now mostly used for emoji stuff. If you only need to be compatible with yourself you can of course do whatever you like, but otherwise, I agree with what the linked page says:

> Users who need to tag text with the language identity should be using standard markup mechanisms, such as those provided by HTML, XML, or other rich text mechanisms. In other contexts, such as databases or internet protocols, language should generally be indicated by appropriate data fields, rather than by embedded language tags or markup.

eviks · on March 25, 2024

The interesting question is why you agree, the deprecation fact isn't telling much, the quote also doesn't explain anything, like, the "appropriate data fields" might not exist for mixed content, a rather common thing, and why resort to the full ugliness of XML just for this?

(and that emojis have had their positive impact in forcing apps into better Unicode support would be a + for the use of a tag)

renhanxue · on March 25, 2024

Most applications do not do anything useful with in-band language tags. They never had widespread adoption in the first place and have been deprecated since 2008, so this is unsurprising. If you're using them in your strings and those strings might end up displayed by any code you don't control, you'll probably want to strip out the language tags to avoid any potential problems or unexpected behaviors. Out-of-band metadata doesn't have this problem.

As I said though, if you're in full control and only need to be compatible with yourself, you can do whatever you want.

eviks · on March 25, 2024

in 2008 uft-8 was only ~20% of all web pages! Again, that deprecation fact is not meaningful, a quick search shows that rfc for tagging is dated 1999, so that's just 10 years before deprecation, that's a tiny timeframe for such things, so I agree, it's not surprising there was no widespread use.

Out-of-band metadata has plenty of other problems besides the fact that it doesn't exist in a lot of cases

hyperdimension · on March 26, 2024

> a search function that doesn't find "coöperation" when you search for "cooperation" is also very frustrating.

Look, we can just disregard The New Yorker entirely and the UX will improve.

earthboundkid · on March 29, 2024

Exactly! Thank you for giving a good explanation of why this whole post is founded on a fundamental misunderstanding.