Boy it would sure be easier if the Turkish i was a different unicode character i...

elevatortrim · 2025-05-06T09:48:53 1746524933

Not sure about this. For this to work, one of these would need to happen:

1. Have two "i" characters on Turkish keyboards, one to use when writing in English, one in Turkish. Sounds difficult to get used to. Always need to be conscious about whether writing an "English i", or a "Turkish i".

2. "i" key is interpreted as English "i" when in English locale, as a special unicode character when in Turkish locale. This would be a nightmare as you would then always have to be conscious of your locale. Writing in English? Switch to English locale. Writing code? Switch to English locale. Writing a Turkish string literal in code? Switch to Turkish, then switch back. It would need to be a constant switching between back and forth even though both are Latin alphabet.

JimDabell · 2025-05-06T10:52:29 1746528749

> "i" key is interpreted as English "i" when in English locale, as a special unicode character when in Turkish locale. This would be a nightmare as you would then always have to be conscious of your locale.

Isn’t this already the case with other languages? For instance, the same key on the keyboard produces a semicolon (;) in English and a Greek question mark (;) in Greek. These are distinct characters that are rendered the same (and also an easy way to troll a developer who uses an editor that doesn’t highlight non-ASCII confusables).

int_19h · 2025-05-07T00:27:30 1746577650

Not only that, but this is already the case with "i" specifically. Cyrillic "і" is different from Latin "i", and it's also located in a different place on the keyboard in the corresponding layouts.

alexey-salmin · 2025-05-06T10:00:56 1746525656

> 1. Have two "i" characters on Turkish keyboards, one to use when writing in English, one in Turkish. Sounds difficult to get used to. Always need to be conscious about whether writing an "English i", or a "Turkish i".

But you have to do that anyway to be able to produce the correct capitalized version: an "English I" or a "Turkish İ".

daveliepmann · 2025-05-06T10:10:12 1746526212

No: a Turkish keyboard has separate i/İ and ı/I keys, and Türkish-writing users with an American/international keyboard use a keyboard layout with modifier keys so that the i/I key can be altered to ı/İ. (I do the latter for idiosyncratic reasons.)

The person you're replying to is pointing out that differentiating English-i from Türkish-i requires some other unwieldy workaround. Would you expect manufacturers to add a third key for English i, or for people with Turkish keyboards to use a modifier key (or locale switching) to distinguish i from i? All workarounds seem extraordinarily unlikely.

elevatortrim · 2025-05-06T10:58:09 1746529089

Hmm, you are kind of right but not exactly:

Yes, there are two keys, but their function is not to write the character as a "Turkish i" and an "English i". These keys are necessary because there are 4 variations, that need 2 keys to write with caps lock on and off:

Key 1 - Big and small Turkish "I": Caps Lock On: I Caps Lock Off: ı

Key 2 - Big and small Turkish "İ": Caps Lock On: İ Caps Lock Off: i

For small "Turkish i" and "English i" to be different characters, there would need to be a third key.

sebstefan · 2025-05-06T10:24:14 1746527054

Ah, that's because I thought turks and azerbaijanis just switched keyboard layouts to type in english and to type in their native language.

elevatortrim · 2025-05-06T11:03:41 1746529421

That's a sensible thought but Turkish QWERTY keyboard includes both the English-exclusive (Q, X, W) and Turkish-exclusive characters so switching is rarely required.

lifthrasiir · 2025-05-06T09:06:33 1746522393

Impossible because the decision was already made by Turkish encodings, which made Unicode to pick only one option (round-trip compatibility with legacy encodings) out of possible trade-offs.

alexey-salmin · 2025-05-06T09:53:16 1746525196

What were the other possible trade-offs? I don't really see how lack of round-trip compatibility is worse than what we have now. It's breaking the whole idea of Unicode code points and for what.

thaumasiotes · 2025-05-06T12:14:11 1746533651

Actually it reflects the idea of Unicode code points correctly. They are meant to represent graphs, not semantics.

This isn't honored; we have many Unicode code points that look identical by definition and differ only in their secret semantics, but all of those points are in violation of the principles of Unicode. The Turkish 'i' is doing the right thing.

ubutler · 2025-05-06T14:00:37 1746540037

> Actually it reflects the idea of Unicode code points correctly. They are meant to represent graphs, not semantics.

Why do we then have lots of invisible characters that are intended essentially as semantic markers (eg, zero-width space)?

alexey-salmin · 2025-05-06T12:37:32 1746535052

How do you define "look identical" outside of fonts which from my understanding were excluded from Unicode consideration on purpose?

E.g. Cyrillic "а" looks the same as Latin "a" most of the time, they both are distant descendants of the Phoenician 𐤀, but they are two different letters now. I'm very glad they have different code points, it would be a nightmare otherwise.

anticensor · 2025-05-06T22:55:32 1746572132

And they would call it Greco-Roman unification, similar to Han unification.

thaumasiotes · 2025-05-06T23:43:47 1746575027

No, Han unification is a completely different thing. Unicode Han unification represents distinct glyphs as the same code point - the intent is that you choose the glyph you want by setting a font (!). This has been acknowledged as a mistake.

Having distinct code points for Latin capital letter A, Greek capital letter A, and Cyrillic capital letter A is the reverse, separate code points for glyphs that are identical by definition. That's also a mistake.

(Although it might be required by Unicode's other principle of being fully compatible with a wide variety of older encodings. There are many characters, like 囍, that don't qualify to have a code point, but that have one anyway because they're present in an encoding that Unicode commits to represent.)

gtbot2007 · 2025-05-06T15:01:11 1746543671

No that’s the opposite of how it’s supposed to work

zokier · 2025-05-06T11:00:55 1746529255

How would separate code point break round-tripping specifically?

dhosek · 2025-05-06T14:37:27 1746542247

The legacy Turkish encoding used ASCII i but a character in the 128–255 range for İ. Remember that not all documents are monolingual so you might have a document with, e.g., both English and Turkish text and in the legacy code page these would use i for both the English and Turkish letter.

zokier · 2025-05-06T17:01:35 1746550895

That didn't answer the question, why would separate code point for Turkish lower-case dotted i break round-tripping?

dhosek · 2025-05-06T18:22:27 1746555747

Because just because something is in Turkish doesn’t mean it doesn’t also include non-Turkish text. So you end up with weird edge cases when translating mixed text back and forth since it would be a single glyph in legacy Turkish 8-bit text but two glyphs in Unicode so Unicode text that might have “Kırgızistan (English: Kyrgyzstan)” in it under your scheme with a Unicode-Legacy-Unicode roundtrip would encode the i in English as the Turkish dotted i.

zokier · 2025-05-06T19:06:03 1746558363

If "turkish lower case dotted i" would be separate codepoint, that still wouldn't cause ambiguity like you describe. It would just mean that "U+0069 latin small letter i" would not be (directly) transcodable to the legacy Turkish character set. But that wouldn't really be any different from other similar homoglyph situations, for example "U+0430 cyrillic small letter a" does not transcode to ASCII and that is business as usual. U+0069 not being transcodeable to some legacy encoding is not really a round-tripping problem, vast majority of Unicode codepoints are not transcodable to any single legacy encoding. Round-trip compatibility is really only concern when going from legacy-unicode-legacy; it is naturally expected that not all strings will be able to roundtrip unicode-legacy-unicode.

dhosek · 2025-05-06T21:14:15 1746566055

EXCEPT that the legacy Cyrillic codepages had separate codepoints for Latin a and Cyrillic а. You’re also making assumptions about the roundtrip preservation that are invalid. The idea is that if a string is encodable in the legacy codepage, you should be able to make the roundtrip. Yes, you can’t roundtrip ⨋ to most legacy codepages, but that’s not the brief.

zokier · 2025-05-08T14:10:54 1746713454

> The idea is that if a string is encodable in the legacy codepage, you should be able to make the roundtrip.

But the which strings are encodable in legacy codepage depends on what we define as encodable! If we had separate codepoint for "turkish small letter i" then we could have simply defined that "latin small letter i" is not encodable in legacy turkish codepage, same way that "cyrillic small letter a" is not encodable to turkish legacy codepage. "turkish small letter i" and "latin small letter i" would be just another normal homoglyph pair, same as "cyrillic small letter a" and "latin small letter a".

sebstefan · 2025-05-06T12:01:09 1746532869

I don't know about round-tripping of anything but suddenly having an entire nation with keyboard outputting utf-8 on outdated national systems probably designed for Latin1 seems like a tough sell to fix this issue

sebstefan · 2025-05-06T09:31:11 1746523871

Yep I'm aware

jeroenhd · 2025-05-06T09:35:01 1746524101

It does (U+0131 = Latin Small Letter Dotless I, U+0069 = Latin Small Letter I).

The problem is that uppercasing the dotted i outputs a different character depending on your current locale. Using case-insensitive equality checks also break this way (I==i, except in a Turkish locale, so `QUIT ilike quit` is false).

rob74 · 2025-05-06T09:44:02 1746524642

Yes - the problem is that "i" and "I" are standard ASCII characters, while the dotted I and the dotless i are not. Creating special "Turkish I" and "Turkish i" characters would have been an alternative, but would have had its own issues (e.g. documents where only some "i"s are Turkish and the rest "regular" because different people edited it with different software/settings).

mrspuratic · 2025-05-06T12:15:59 1746533759

Irish script traditionally used a dot-less "i", something that persists in current road signage (anecdotally to save confusion with "í", or with adjacent old-style dotted consonants, I can't find a definitive source to cite). It's only an orthographic/type thing, it's semantically an "i", though the Unicode dot-less "i" is sometimes used online to represent it.

tmtvl · 2025-05-06T09:45:55 1746524755

Is it? That's weird, I can't find the code for Latin Small Letter Dotted I. There is a Cyrillic dotted I, but that one doesn't have the dot in capitalised form.

What sebstefan is asking for is a Unicode character which is the non-capitalised form of Latin Capital Letter I With Dot Above (U+0130) which always gets capitalised to U+0130 and which U+0130 gets downcased to.

anticensor · 2025-05-06T23:00:41 1746572441

And DELETE DOT ABOVE would wnd that locale dependency.

makeitdouble · 2025-05-06T09:32:10 1746523930

I'm imagining coding with some random "i" being a different completely undistinguishable character from the English "i". Or people writing your name and not matching in their DB because their local "i" is not your "i".

It's a potential issue already depending on your script, and CJK also has this funny full English alphabet but all in double-width characters that makes it PITA for people who can't distinguish the two. But having it on a character as common as "i" would feel specially hellish to me.

sebstefan · 2025-05-06T09:38:51 1746524331

It wouldn't matter

There's already this problem for cyrillic 'e' and latin 'e' and hundreds of other characters

People use it to create lookalike URLs and phish people

https://www.pcmag.com/news/chrome-blocks-crafty-url-phishing...

makeitdouble · 2025-05-06T11:35:49 1746531349

Cyrillic 'e' is isolated in that you switch script when writing it. I'd compare it to the greek X.

Turkish isn't on a fully separate script, most letters are standard ascii and only a few are special (it's closer to French or German with the accentuated characters), so you don't have the explicit switch, it's always mixed.

sebstefan · 2025-05-06T12:09:00 1746533340

Then you have the greek question mark ;

alexey-salmin · 2025-05-06T10:02:13 1746525733

> But having it on a character as common as "i" would feel specially hellish to me.

https://en.wikipedia.org/wiki/Dotted_I_(Cyrillic)