*its[sic] 2024, and we are still grappling with Unicode character encoding probl...

bornfreddy · on March 24, 2024

You mean this wouldn't be a problem if we used the myriad different encodings like we did before Unicode, because we would probably not be able to even save the files anyway? So true.

userbinator · on March 24, 2024

Before Unicode, most systems were effectively "byte-transparent" and encoding only a top-level concern. Those working in one language would use the appropriate encoding (likely CP1252 for most Latin languages) and there wouldn't be confusion about different bytes for same-looking characters.

deathanatos · on March 24, 2024

A single user system, perhaps.

I've worked on a system that … well, didn't predate Unicode, but was sort of near the leading edge of it and was multi-system.

The database columns containing text were all byte arrays. And because the client (a Windows tool, but honestly Linux isn't any better off here) just took a LPCSTR or whatever, it they bytes were just in whatever locale the client was. But that was recorded nowhere, and of course, all the rows were in different locales.

I think that would be far more common, today, if Unicode had never come along.

bawolff · on March 24, 2024

My understanding is way back in the day, people would use ascii backspace to combine an ascii letter with an ascii accent character.

kps · on March 24, 2024

ASCII 1967 (and the equivalent ECMA-6) suggested this, and that the characters ,"'`~ could be shaped to look like a cedilla, diaeresis, acute accent, grave accent, and raised tilde respectively for that purpose. But I've never once seen or heard of that method used.

ASCII also allowed the characters @[\]^{|}~ to be replaced by others in ‘national character allocations’, and this was commonly used in the 7-bit ASCII era.

In the 8-bit days, for alphabetic scripts, typically the range 0xA0–0xFF would represent a block of characters (e.g. an ISO 8859¹ range) selected by convention or explicitly by ISO 2022². (There were also pre-standard similar methods like DEC NRCS and IBM's EBCDIC code pages.)

¹ https://en.wikipedia.org/wiki/ISO/IEC_8859

¹ https://en.wikipedia.org/wiki/ISO/IEC_2022

bawolff · on March 25, 2024

Googling i saw people link to http://git.savannah.gnu.org/cgit/bash.git/tree/doc/bash.0 as an example of overstriking (albeit for bold not accents). The telnet rfc also makes reference to it. I also see lots of references in the context of APL.

I suppose in the 60s/70s it would be in the era of teletypewriters where maybe over striking would more naturally be a thing.

I also found references to less supporting this sort of thing, but seems to be about bold and underline, not accents.

kps · on March 25, 2024

nroff did do overstriking for underlining and bold. I don't remember if it did so for accents, but in any case it was for printer output and not plain text itself.

APL did use overstriking extensively, and there were video terminals that knew how to compose overstruck APL characters.

TheRealPomax · on March 25, 2024

SHIFT-JIS and EUC would like a word.

n2d4 · on March 24, 2024

You make it sound like non-English languages were invented in 2024

mschuster91 · on March 24, 2024

> This wouldn't be a problem before the complexity of Unicode became prevalent.

It was a problem even before then. It worked fine as long as you had countries that were composed of one dominant ethnicity that sharted upon how minorities and immigrants lived (they were just forced to use a transliterated name, which could be one hell of a lot of fun for multi-national or adopted people) - and even that wasn't enough to prevent issues. In Germany, for example, someone had to go up to the highest public-service courts in the late 70s [1] to have his name changed from Götz to Goetz because he was pissed off that computers were unable to store the ö and so he'd liked to change his name rather than keep getting mis-named, but German bureaucracy does not like name changes outside of marriage and adoption.

[1] https://www.schweizer.eu//aktuelles/urteile/7304-bverwg-vom-...

bawolff · on March 24, 2024

Combining characters go back to the 90s. The unicode normal forms were defined in the 90s. None of this is new at this point.