> *I do not want emoji on my computer, and I hate Unicode.* Ok well good for you...

zzo38computer · on Aug 27, 2022

It is not true. I do want people who write in other languages to be able to communicate (better than me, not worse), using better encodings than Unicode. I do intend to support other languages writing, including international text (text directions, multibyte encodings, etc). However, Unicode is too messy and isn't good (and has problems such as the Han unification, ambiguous widths, changes in versions of Unicode which complicate the specification, etc). Some people think that TRON code would be better, although TRON has its own problems (for example, its design means that character sets other than JIS don't fit properly), even though there are some advantages, especially for writing in Japanese. My opinion is that it is better to not use a single character set for everything; instead, different ones are useful for different purposes.

(And if needed, program to convert encodings is possible, although this can result in incorrect characters sometimes, depending on the context. It is an approximation if you have no other choice, but it would be better to use proper fonts and text handling for the appropriate code pages.)

scrollaway · on Aug 27, 2022

Think really carefully about it though. Going back to separate charsets for everything would mean losing the ability to communicate multiple languages as part of the same text.

This might not matter to you, but it matters to… for example… anybody learning a language. Например, я провел последние месяцы, изучая русский язык. Как бы я написал это, если бы мой пост был в ASCII?

zzo38computer · on Aug 27, 2022

I have thought about it, and considered many things, and I had concluded that Unicode isn't better. Different applications require different capabilities, and Unicode has problems with mixing languages (including Han unification, and others such as Turkish letter "I"), ambiguous widths, character equivalences, homoglyphs, and other problems. (Which things are desired and undesired, and which things are missing, or are incorrect for the specific use, depends on the application.)

You can still include, in some formats (such as document formats), codes for code page switching. This has some advantages including Chinese/Japanese together in the same document, as well as being a cleaner way to implement mixed text directions, etc.

Also, sometimes the character set does not have all of the Chinese characters. Cangjie can make many more possible Chinese characters, but maybe an extended with up to six parts might allow more combinations; I do not know for sure.

And these considerations are not even close to everything. Different sets of encodings are useful in different uses, I think. TRON code has its own advantages and disadvantages compared with Unicode, but I think that switching encodings also has some advantages (and also disadvantages), too. (There are also some things that I do not know of TRON, due to the documentation being difficult to find.)

One idea that I have seen is having a separate language and glyph field, to allow displaying approximation of characters that you do not have in your computer. Whether or not this is appropriate depends on what you are making, though. I had thought how to do something similar with a "output-only encoding" which has several fields and is also linked to the text in originally whatever encoding it might be, which also allows approximated display of fonts that you do not have in your computer, but it is only used for display and only as a fallback mechanism.

(My opinion is that mixing text directions within a paragraph should not use line breaking in the middle of a quotation that is not in that paragraph's main text direction (for text in the opposite direction, only short quotations should be inline); use a block quotation instead if needed.)

Maybe, I should need to make up ICNU (International Components for Non-Unicode), to properly deal with international text. I have dealt with such things so I know what some of the considerations are, in making an interface with the needed capabilities. (One capability I intend to include is possibility of using TRON character codes too, although you can also use other character sets, including Unicode in some cases (such as existing files and fonts that are using Unicode). However, usually extra data files would be needed for many purposes, but you can specify how to find these files. Furthermore, the ability to disable specified encodings (especially Unicode) also is important.)

My own designs of programs, file formats, protocols, etc are avoiding Unicode as much as possible, since I think that other way is better. However, in some cases you can specify any code page number anyways, so you can still specify code page 1209 for UTF-8 if wanted. In other cases, this does not work (since it is not appropriate for that specific use).