By the same reasoning, the 7-eyed O has now been used more than once, so it deserves a glyph! So the right way to do this is to introduce a new character for the correct glyph, and also leave the current one (perhaps changing the title). Otherwise these tweets won't make when read by someone that updated to Unicode 15.0
Honestly it probably deserves the Pluto treatment: decertification as a character. One historical use in the 1400s doesn't merit a character and never did.
Unicode's mission is to make every document "roundtrip-able". Even if a character is only used once, it should be possible to save a plaintext version of the containing document without losing any information. Roughly, I should be able to put a transcription of that one translation from the 1400s on Wikisource without using images.
You may disagree with me, and that's fine, but it doesn't change Unicode's mission. Besides, there's room for 1,112,064 codepoints[a], and only 149,146 are in use. It's predicted we'll never use it up, so what harm is there in one codepoint no one will ever need?
[a]: U+10'FFFF max; it used to be U+FFFF'FFFF, but UTF-16 and surrogates ruined that
If that was once its mission, it was clearly abandoned long ago. They rejected Klingon characters on the grounds that it has low usage for communication, and that many of the people who do communicate in Klingon use a latinized form.
ꙮ seems to just be a fancy way of writing О. I haven't seen anything that says it has a different meaning. The arguments for excluding Klingon seem to apply even more so to ꙮ.
If you look through the old mailing list postings, the oft-left-implicit problem with Klingon (as well as Tengwar, Everson’s [EDIT: misspelling] pet project) is that it may get people into legal trouble (even though in a reasonable world it shouldn’t be able to). So in the unofficial CSUR / UCSUR they remain.
A weird solitary character from the 1400s isn’t subject to that, and even if it’s a mistake it’s probably not worth breaking compatibility at this point (I think the last such break with code points genuinely changing meanings was to repair a mistaken CJK unification some time in the 00s, and the Consortium may even have tied its own hands in that regard with the ever-more-strict stability policies).
Similarly, for example, old ISO keyboard symbols (the ⌫ for erase backwards, but also a ton of virtually unused ones) were thrown in indiscriminately at the beginning of the project when attempting to cover every existing encoding, but when the ISO decided to extend the repertoire they were told to kindly provide examples of running-text (not iconic) usage in a non-member-body-controlled publication. (Crickets. The ISO keyboard input model itself only vaguely corresponds to how input methods for QWERTY-adjacent keyboards work in existing systems—as an attempt at rationalization, it seems to mostly be a failed one.)
[EDIT: Removed a section about the now-fixed typo]
> I think the last such break with code points genuinely changing meanings was to repair a mistaken CJK unification some time in the 00s, and the Consortium may even have tied its own hands in that regard with the ever-more-strict stability policies[.]
Not exactly, the last break happened between Unicode 1.1 and 2.0 and the new CJK Unified Ideographs Extension A block still contains unified characters. The main reason for break was that both Hangul and CJK(V) ideographs required tons of additional code points and it became clear that 16-bit code space is dangerously insufficient; by 1.1 there was only a single big block of unassigned code points from U+A000 to U+E7FF (18,432 total), and there were 4,516 and 6,582 new Hangul and CJK(V) ideographs in 2.0 (11,098 total).
Unless it's legitimately someone's native tongue, conlangs shouldn't be in unicode. If there are kids out there that are native Klingon speakers, then you can make the argument it should be included.
I think it makes way more sense to put a conlang in Unicode than it does a peculiar stylistic flourish only ever applied once to a single letter in a single document. If that belongs in Unicode, why not every bit of marginalia ever doodled and every uniquely adorned drop cap / initial letter?
“A linguist has revealed he talked only in Klingon to his son for the first three years of his life to find out if he could learn to speak the 'language'.
[…]
Now 13, Speers' son does not speak Klingon at all.”
I see that near it, there is an ef (Ф) with a very tall stem.
Why should that not be included as a standard unicode character? Surely it is used more often than the multiocular o.
You may say "it's a decorative flourish", which is of course true, but so is the multiocular o. Should we allow every conceivable decorative flourish into unicode? What is the standard for where flourishes become distinct characters?
Today, I wrote a document by hand containing a new symbol that only looks like genitalia if you squint really hard. Where do I apply to have it included in unicode so that it can be digitized properly?
Rule-lawyering wise-asses try to mess with many policies. It's rarely a sensible indictment of a policy, nor is it very effective. Anyone dealing with such people just ignores them.
For as inclusive as that mission is, it seems weird to me how limited in certain areas unicode is. For instance, people use peach emoji since there isn't one for butt, eggplant since there's no penis, etc.
This doesn't contradict the stated goal exactly, but it seems against the spirit of it at least.
One could argue that emoji should have never been added to Unicode in the first place. Peaches and butts are images, pictures, illustrations, whatever - but they are not characters. There's no writing system which has a colored drawing of a peach as a character.
Yes there is - a widely used character set (when Unicode talks about "writing systems" it explicitly includes all the computer character sets used in practice pre-unicode) used by japanese 'featurephones' had emoji characters, so in order to be able to include that character set in unicode, unicode had to add emoji.
They're sort of neither. The peach emoji will render differently on iOS, Android, Windows. And I'm sure emoji-replacement packs are possible on Windows and Android (even though it's also guaranteed to be a virus).
So a peach emoji is not the same thing as the iOS peach-emoji-image. Similar to how changing my font doesn't change the actual characters.
I don't think including emojis was a great idea, but now that it's happened and people everywhere use them, emoji have become characters. I agree with your point, but it's already happened and so now there's not really any going back.
But that doesn't change the fact that most people use them snd like them, and there is not much technical disruption. They just chose practicality over purity.
Not only that - people use them in textual communication the way letters traditionally are used. There is probably a lot better argument for emoiji than a lot of other things in unicode (but it is a slippery slope)
That wouldn't be practical. It would make fonts too big, and videos aren't a thing that goes inline in text.
However, I could totally see some kind of open source GIF library of a few hundred meme videos and pictures, to standardize the "Reply with a GIF" thing in some P2P chat ecosystem, and maybe it could have a new URL scheme for referring to OpenMemes images.
I tried to reply with just a unicode penis but that got flagged immediately, so I'll be more substantial and leave out the actual penis. It appears in Egyptian hieroglyphs, so actually there is a penis included in unicode.
That's true, good call. I feel like there should be one without the context of Egyptian hieroglphys though, though I'm not exactly sure how that kind of thing works in unicode.
> For instance, people use peach emoji since there isn't one for butt, eggplant since there's no penis, etc.
Personally I think there should be, actually. There's all these other body parts but these are left out. Emoji is almost becoming a language and the good thing is that everyone can understand them, regardless of language. For example I could imagine these could be very useful in an international medical setting. Or for sexting, obviously, we can pretend that's not a thing but that's a bit too Victorian for me.
Of course they're not appropriate in some settings but so are many words.
I REALLY don't like that emojis are beholden to companies. For example, when the emoji for a gun was changed from a pistol to a squirtgun on many platforms, it changed the meaning of its use by a lot. You could argue that it is a good thing, but I see it as a pretty bad direction to go into.
Unicode doesn't have a character for every illuminated initial, nor should it. I'm not clear on why this character should be considered any differently.
Wow, this is probably the most actually useful and interesting comment in this whole discussion, thanks! For anyone interested, the most relevant quotes from the document are in particular:
"This document requests the addition of a number of Cyrillic characters to be added to the UCS. It also requests clarification in the Unicode Standard of four existing characters. This is a large proposal. While all of the characters are either Cyrillic characters (plus a couple which are used with the Cyrillic script), they are used by different communities. Some are used for non-Slavic minority languages and others are used for early Slavic philology and linguistics, while others are used in more recent ecclesiastical contexts. We considered the possibility of dividing the proposal into several proposals, but since this proposal involves changes to glyphs in the main Cyrillic block, adds a character to the main Cyrillic block, adds 16 characters to the Cyrillic Supplement block, adds 10 characters to the new Cyrillic Extended-A block currently under ballot, creates two entirely new Cyrillic blocks with 55 and 26 characters respectively, as well as adding two characters to the Supplementary Punctuation block, it seemed best for reviewers to keep everything together in one document.
(...)
MONOCULAR O Ꙩꙩ, BINOCULAR O Ꙫꙫ, DOUBLE MONOCULAR O Ꙭꙭ, and MULTIOCULAR O ꙮ are used in words which are based on the root for ‘eye’. The first is used when the wordform is singular, as ꙩкꙩ; the second and third are used in the root for ‘eye’ when the wordform is dual, as ꙫчи, ꙭчи; and the last in the epithet ‘many-eyed’ as in серафими многоꙮчитїй ‘many-eyed seraphim’. It has no upper-case form. See Figures 34, 41, 42, 55."
Because it's already been added to unicode. Now it's not a question of whether or not to add, rather to remove, and unicode almost by definition does not remove.
Meanwhile one still can't roundtrip regular Japanese without some kind of funky out-of-band signalling. By itself this kind of thing is harmless, but it speaks to poor prioritization from Unicode.
This is incorrect. I think you defined round-trip as something else, but some character set A providing a round-trip compatibility with other set B means that B can be converted to A and back to B without a loss. And it is one of Unicode's explicit goals to provide a round-trip compatibility with major encodings including Japanese ones.
Han unification only means that when you convert Japanese encodings (B) to Unicode (A), it is not distinguishable from non-Japanese encodings converted to Unicode. This means that the Unicode text doesn't always follow domestic conventions without out-of-band signaling or IVD or so. But if you know that the text was converted from a particular encoding, you can perfectly recover the original text encoded in that encoding.
By that logic any 8-byte encoding is round-trip compatible with all encodings, since however bad the mojibake is, if you know what the original encoding was then you can always just convert back to that.
To be fair they wanted to keep everything representable with 16 bits and that wasn't going to happen without the Han-unification. The mess when everything still had to move to a 32 bit representation has been far reaching, many programming languages went from exposing code points atomically as "char" to some half encoded nonsense value that just happens to also be a valid standalone value in UTF-16 most of the time and a source of bugs when you least expect it.
"Han Unification" - in Unicode many Japanese characters are represented as Chinese characters that look different (and subjectively ugly). The Unicode consortium's answer is that you're supposed to use a different font or something when displaying Japanese, which is pretty unsatisfying (e.g. if you want to have a block of text that contains both Japanese and Chinese, you can't represent that as just a Unicode string, it has to be some kind of rope of segments with their own fonts, at which point frankly you might as well just go back to bytes-with-encoding which at least breaks very clearly and visibly if you get it wrong).
The thing is, this is just a decorative way to write “o”. It’s not a specific letter by any definition.
I can’t speak of other letters that were added in the same batch in 2007. Some of them seam meaningful, I donno, I don’t speak old church slavonic (although I am told it sounds like Croatian, which I understand a little)
> so what harm is there in one codepoint no one will ever need?
Fonts bloat (do you want a font with 1 million characters in it ? I don’t. Do you want to have to install 1000 fonts having 1000 characters each to be sure to cover all the Unicode table ? I don't).
Lots of issues for everyday programmers (how do you handle weird unicode characters in your validation code ?) potentially leading to security issues (bypassing validation rules by close-but-different characters, phishing…)
The artist Prince changed his stage name to an unpronounceable symbol for a few years. It appears in more than one document. Should it be added to Unicode?
Isn’t there an entire Unicode block for the symbols on the Phaistos disc? Yes: https://en.wikipedia.org/wiki/Phaistos_Disc_(Unicode_block) . I suppose those occur in quite a few documents about the disc, even though the disc itself is the only known document written in those symbols.
One historical use in the 1400s doesn't merit a character and never did
One known and surviving use. It is possible that it exists in other places, since the vast majority of the planet's written work has not been digitized. It may also have been used other places that have not survived.
Just because it's not important to you does not mean it is not important.
The fact that is survived for 600 years makes it interesting and worth saving. It is infinitely unlikely that anything you do, write, or say will last that long.
Sure it's possible, but there should be a higher bar than "it's possible it's used more than once" for meriting inclusion in the standard keyboard of billions of devices worldwide.
The thing is, looking at the page, there are many other characters that were not added - the large red С-looking characters, for example. But for some "bizarre" reason, those were not included in Unicode...
Of course, the simple answer is that Unicode actually includes any character that someone cares enough to ask to be added, with rare exceptions.
While the origin of 彁 will never be certain, there is a good chance that it came from a misinterpretation of 彊 [1]. Why is this not an accepted theory though? Because it is still possible that 彁 did appear in some reference source from the standardization, and neither that source or a source where 彊 does look like 彁 was found.
idk. when the word Planet was redefined such that Pluto was no longer a planet, it kind of ruined the word Planet. It suddenly wasn’t nearly as useful as a word as it used to (even though now it has a precise meaning). For most people that use the word, it won’t matter (and is actually rather exciting) that they keep discovering new planets in our solar system.
If they’d treat the word characters the same way, it would only serve to confuse and do no favors to the remaining glyphs.
This is temporary though, soon people will look at you funny if you say that Pluto is a planet - and/or they might not even have heardof it (though of course that is still worth learning about in an History of Science context).
We do NOT keep discovering new planets, rather minor planets (I agree that the term is confusing), more than a million of them discovered in the Solar System now, like the 9007 James Bond.
It could go either way, it is not always that the scientific meaning wins out, especially not when even scientists don’t find the new definition useful.
When I think of a planet, I think of a world that has active geology that isn’t a moon (I know excluding moons is arbitrary, and perhaps I shouldn’t do that; but hey, that’s language for you). I honestly don’t care about the orbit, and I bet that when most people think about planets they aren’t thinking about the orbit either, let alone whether the planet has cleared the orbit or not. I doubt that will change.
No just that, but whether or not Mars is still geologically active is still an open question. If you admit planets on the basis that they have a history of geological activity, then Ceres is a planet too.
I don’t think anybody considers geological activity as particularly useful for classifying things as ‘planet’ or ‘not planet’.
Why shouldn’t Ceres be a planet? If Pluto gets to be a planet then Ceres is definitely a planet.
But there is still active geology on Mars. There is still moisture, winds and ice-caps that are shaping the environment. I consider that to be geologically active.
EDIT: And there are actual experts which consider active geology (or something similar) to be a planet, including Anton Petrov (https://www.youtube.com/watch?v=8-2HxrgqUnM)
Okay, but then you have to go and figure out which other asteroid and kuiper belt objects are planets.
The 'dwarf planet' distinction helps solve this! There are planets - distinctive in that they have clear orbits - and there are dwarf planets, which can be part of belt systems. This is a useful distinction.
Sure it is, but the distinction between terrestrial planets and gas giants are also useful, that doesn’t mean the latter aren’t planets.
I think it is fine that there are more objects planets then we can meaningfully count. Loads of things in our language act like that. E.g. a bug can be any number of things, and you know what a bug is by just talking about it. If some insect society then comes up with a meaningful definition of bugs which excludes spiders, that definition isn’t really doing the average user of that word any favor.
Yeah, probably strictly... But I’m not a planetary scientist. I’m merely a user of language, and I don’t need to be rigorous in my definitions. And to me the weather patterns on Jupiter is an interesting feature enough to count as geology (even though it is probably not strictly a geology).
Theoretically, UTF-8 can encode up to 31 bits (U+7FFF'FFFF)[0], but for compatibility with UTF-16's surrogates, it's officially capped to 21 bits with the max being U+10'FFFF[1]. That decision was made November 2003, so there's two decades of software written with hard caps of U+10'FFFF.
Yes, but this is a change either way, because that codepoint's definition referred to that character. Either the reference or the description of the appearance has to change.
Make a new character. Updating the existing character ruins the meaning of all previous usages.
It's like trying to change an API. Don't disrespect your existing users. Make a new version.
(ꙮ ͜ʖꙮ)
Think of all the ASCII art this botches. That has to have some historical importance to the Unicode standards body.
(⌐ꙮ_ꙮ)
For scholarly digital (unprinted) documents where the correct character rendering matters, erroneous past usages can be trivially found with grep, a date search, and easily corrected. The domain experts will familiarize themselves with this issue and fix the problem. Don't take a shotgun to it!
This message wꙮn't have the ꙮriginally intended meaning if the characters are updated from underneath.