By the same reasoning, the 7-eyed O has now been used more than once, so it dese...

koboll · on Sept 19, 2022

Honestly it probably deserves the Pluto treatment: decertification as a character. One historical use in the 1400s doesn't merit a character and never did.

colejohnson66 · on Sept 19, 2022

Unicode's mission is to make every document "roundtrip-able". Even if a character is only used once, it should be possible to save a plaintext version of the containing document without losing any information. Roughly, I should be able to put a transcription of that one translation from the 1400s on Wikisource without using images.

You may disagree with me, and that's fine, but it doesn't change Unicode's mission. Besides, there's room for 1,112,064 codepoints[a], and only 149,146 are in use. It's predicted we'll never use it up, so what harm is there in one codepoint no one will ever need?

[a]: U+10'FFFF max; it used to be U+FFFF'FFFF, but UTF-16 and surrogates ruined that

tzs · on Sept 19, 2022

If that was once its mission, it was clearly abandoned long ago. They rejected Klingon characters on the grounds that it has low usage for communication, and that many of the people who do communicate in Klingon use a latinized form.

ꙮ seems to just be a fancy way of writing О. I haven't seen anything that says it has a different meaning. The arguments for excluding Klingon seem to apply even more so to ꙮ.

mananaysiempre · on Sept 19, 2022

If you look through the old mailing list postings, the oft-left-implicit problem with Klingon (as well as Tengwar, Everson’s [EDIT: misspelling] pet project) is that it may get people into legal trouble (even though in a reasonable world it shouldn’t be able to). So in the unofficial CSUR / UCSUR they remain.

A weird solitary character from the 1400s isn’t subject to that, and even if it’s a mistake it’s probably not worth breaking compatibility at this point (I think the last such break with code points genuinely changing meanings was to repair a mistaken CJK unification some time in the 00s, and the Consortium may even have tied its own hands in that regard with the ever-more-strict stability policies).

Similarly, for example, old ISO keyboard symbols (the ⌫ for erase backwards, but also a ton of virtually unused ones) were thrown in indiscriminately at the beginning of the project when attempting to cover every existing encoding, but when the ISO decided to extend the repertoire they were told to kindly provide examples of running-text (not iconic) usage in a non-member-body-controlled publication. (Crickets. The ISO keyboard input model itself only vaguely corresponds to how input methods for QWERTY-adjacent keyboards work in existing systems—as an attempt at rationalization, it seems to mostly be a failed one.)

lifthrasiir · on Sept 19, 2022

[EDIT: Removed a section about the now-fixed typo]

> I think the last such break with code points genuinely changing meanings was to repair a mistaken CJK unification some time in the 00s, and the Consortium may even have tied its own hands in that regard with the ever-more-strict stability policies[.]

Not exactly, the last break happened between Unicode 1.1 and 2.0 and the new CJK Unified Ideographs Extension A block still contains unified characters. The main reason for break was that both Hangul and CJK(V) ideographs required tons of additional code points and it became clear that 16-bit code space is dangerously insufficient; by 1.1 there was only a single big block of unassigned code points from U+A000 to U+E7FF (18,432 total), and there were 4,516 and 6,582 new Hangul and CJK(V) ideographs in 2.0 (11,098 total).

bobsmooth · on Sept 19, 2022

Unless it's legitimately someone's native tongue, conlangs shouldn't be in unicode. If there are kids out there that are native Klingon speakers, then you can make the argument it should be included.

thfuran · on Sept 19, 2022

I think it makes way more sense to put a conlang in Unicode than it does a peculiar stylistic flourish only ever applied once to a single letter in a single document. If that belongs in Unicode, why not every bit of marginalia ever doodled and every uniquely adorned drop cap / initial letter?

hoseja · on Sept 20, 2022

There is a smattering of missionary-made alphabets that have way less usage than some conlangs. Why are they legitimate but conlangs aren't?

brokensegue · on Sept 20, 2022

so all you need is one crazy parent? shouldn't be too hard to find

Someone · on Sept 20, 2022

https://www.dailymail.co.uk/news/article-1229808/Linguist-re... (2009):

“A linguist has revealed he talked only in Klingon to his son for the first three years of his life to find out if he could learn to speak the 'language'.

[…]

Now 13, Speers' son does not speak Klingon at all.”

koboll · on Sept 19, 2022

Okay, let's take a look at the context where the multiocular o was used: https://en.wikipedia.org/wiki/Multiocular_O

I see that near it, there is an ef (Ф) with a very tall stem.

Why should that not be included as a standard unicode character? Surely it is used more often than the multiocular o.

You may say "it's a decorative flourish", which is of course true, but so is the multiocular o. Should we allow every conceivable decorative flourish into unicode? What is the standard for where flourishes become distinct characters?

bityard · on Sept 19, 2022

Today, I wrote a document by hand containing a new symbol that only looks like genitalia if you squint really hard. Where do I apply to have it included in unicode so that it can be digitized properly?

lucumo · on Sept 19, 2022

Rule-lawyering wise-asses try to mess with many policies. It's rarely a sensible indictment of a policy, nor is it very effective. Anyone dealing with such people just ignores them.

fluoridation · on Sept 19, 2022

What's the criterion that includes the document in the tweet, but excludes the document referenced by the GP?

bzxcvbn · on Sept 19, 2022

https://www.unicode.org/pending/proposals.html

https://www.unicode.org/emoji/proposals.html#selection_facto...

fluoridation · on Sept 19, 2022

I don't see any anything on the inclusion of symbols that are not icons, such as U+A66E, or the symbol proposed by bityard.

koala_man · on Sept 19, 2022

Can you reuse 𓂸 or 𓂺?

0xbadcafebee · on Sept 19, 2022

And for years we've just been using eggplants!

_abox · on Sept 20, 2022

Lol I never knew those existed. Apparently they're egyption hieroglyphs but it really makes me wonder the meanings of them now :)

I also wonder how these didn't become insanely popular overnight, like the famous eggplant.

411111111111111 · on Sept 19, 2022

( ﾉ ≧ ∇ ≦ ) ﾉﾐ ┻ ━ ┻

〜 ( ꒪ ꒳ ꒪ ) 〜

( ༎ຶ ෴ ༎ຶ )

Akronymus · on Sept 21, 2022

Reuse rectangle or rectangle? (Seems to not render for me, win 10, chrome)

bawolff · on Sept 19, 2022

Given that “𓂸” (U+130B8) is already in unicode (and related 𓂹,𓂺) pretty sure the only problem is you made it up, not that it looks like genitilia

kadoban · on Sept 19, 2022

For as inclusive as that mission is, it seems weird to me how limited in certain areas unicode is. For instance, people use peach emoji since there isn't one for butt, eggplant since there's no penis, etc.

This doesn't contradict the stated goal exactly, but it seems against the spirit of it at least.

yakireev · on Sept 19, 2022

One could argue that emoji should have never been added to Unicode in the first place. Peaches and butts are images, pictures, illustrations, whatever - but they are not characters. There's no writing system which has a colored drawing of a peach as a character.

PeterisP · on Sept 19, 2022

Yes there is - a widely used character set (when Unicode talks about "writing systems" it explicitly includes all the computer character sets used in practice pre-unicode) used by japanese 'featurephones' had emoji characters, so in order to be able to include that character set in unicode, unicode had to add emoji.

squeaky-clean · on Sept 19, 2022

They're sort of neither. The peach emoji will render differently on iOS, Android, Windows. And I'm sure emoji-replacement packs are possible on Windows and Android (even though it's also guaranteed to be a virus).

So a peach emoji is not the same thing as the iOS peach-emoji-image. Similar to how changing my font doesn't change the actual characters.

I don't think including emojis was a great idea, but now that it's happened and people everywhere use them, emoji have become characters. I agree with your point, but it's already happened and so now there's not really any going back.

bzxcvbn · on Sept 19, 2022

Yes there is. We're using it right now. Even linguists are studying the use of emoji today.

eternityforest · on Sept 19, 2022

But that doesn't change the fact that most people use them snd like them, and there is not much technical disruption. They just chose practicality over purity.

bawolff · on Sept 19, 2022

Not only that - people use them in textual communication the way letters traditionally are used. There is probably a lot better argument for emoiji than a lot of other things in unicode (but it is a slippery slope)

yakireev · on Sept 19, 2022

Most people (me included) like funny cat videos and send funny cat videos. Shall we include some to Unicode?

I mean, this ship has long sailed, but that was a mistake nevertheless. Not everything has to be a unicode character.

kadoban · on Sept 20, 2022

If there were specific funny cat videos that were culturally relevent, maybe.

For example I could see there being an emoji for keyboard cat.

eternityforest · on Sept 20, 2022

That wouldn't be practical. It would make fonts too big, and videos aren't a thing that goes inline in text.

However, I could totally see some kind of open source GIF library of a few hundred meme videos and pictures, to standardize the "Reply with a GIF" thing in some P2P chat ecosystem, and maybe it could have a new URL scheme for referring to OpenMemes images.

hoseja · on Sept 20, 2022

You can have entire sentences constructed out of emoji.

vcxy · on Sept 19, 2022

I tried to reply with just a unicode penis but that got flagged immediately, so I'll be more substantial and leave out the actual penis. It appears in Egyptian hieroglyphs, so actually there is a penis included in unicode.

kadoban · on Sept 20, 2022

That's true, good call. I feel like there should be one without the context of Egyptian hieroglphys though, though I'm not exactly sure how that kind of thing works in unicode.

pbhjpbhj · on Sept 20, 2022

I thought peach was a vulva. What's the emoji for a vulva then?

Don't tell me Presidents of the United States' song "Peaches" was about butts!?

https://en.wikipedia.org/wiki/Peaches_(The_Presidents_of_the...

kadoban · on Sept 20, 2022

I'm not sure if there is one. Or maybe it's used for both, depending on context/community?

The song was definitely a vaginal reference as far as I ever knew.

_abox · on Sept 20, 2022

> For instance, people use peach emoji since there isn't one for butt, eggplant since there's no penis, etc.

Personally I think there should be, actually. There's all these other body parts but these are left out. Emoji is almost becoming a language and the good thing is that everyone can understand them, regardless of language. For example I could imagine these could be very useful in an international medical setting. Or for sexting, obviously, we can pretend that's not a thing but that's a bit too Victorian for me.

Of course they're not appropriate in some settings but so are many words.

Akronymus · on Sept 21, 2022

I REALLY don't like that emojis are beholden to companies. For example, when the emoji for a gun was changed from a pistol to a squirtgun on many platforms, it changed the meaning of its use by a lot. You could argue that it is a good thing, but I see it as a pretty bad direction to go into.

djur · on Sept 19, 2022

Unicode doesn't have a character for every illuminated initial, nor should it. I'm not clear on why this character should be considered any differently.

skyyler · on Sept 19, 2022

http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3194.pdf

It was introduced with other "ocular O"s which are seemingly more commonly used than this one.

It's not quite an illuminated initial.

akavel · on Sept 19, 2022

Wow, this is probably the most actually useful and interesting comment in this whole discussion, thanks! For anyone interested, the most relevant quotes from the document are in particular:

"This document requests the addition of a number of Cyrillic characters to be added to the UCS. It also requests clarification in the Unicode Standard of four existing characters. This is a large proposal. While all of the characters are either Cyrillic characters (plus a couple which are used with the Cyrillic script), they are used by different communities. Some are used for non-Slavic minority languages and others are used for early Slavic philology and linguistics, while others are used in more recent ecclesiastical contexts. We considered the possibility of dividing the proposal into several proposals, but since this proposal involves changes to glyphs in the main Cyrillic block, adds a character to the main Cyrillic block, adds 16 characters to the Cyrillic Supplement block, adds 10 characters to the new Cyrillic Extended-A block currently under ballot, creates two entirely new Cyrillic blocks with 55 and 26 characters respectively, as well as adding two characters to the Supplementary Punctuation block, it seemed best for reviewers to keep everything together in one document.

(...)

MONOCULAR O Ꙩꙩ, BINOCULAR O Ꙫꙫ, DOUBLE MONOCULAR O Ꙭꙭ, and MULTIOCULAR O ꙮ are used in words which are based on the root for ‘eye’. The first is used when the wordform is singular, as ꙩкꙩ; the second and third are used in the root for ‘eye’ when the wordform is dual, as ꙫчи, ꙭчи; and the last in the epithet ‘many-eyed’ as in серафими многоꙮчитїй ‘many-eyed seraphim’. It has no upper-case form. See Figures 34, 41, 42, 55."

prmoustache · on Sept 20, 2022

Literally everyone in choir: we have boooobs!

j-bos · on Sept 19, 2022

Because it's already been added to unicode. Now it's not a question of whether or not to add, rather to remove, and unicode almost by definition does not remove.

thayne · on Sept 19, 2022

Unicode does have deprecated code points though. Not that I necessarily think making this character deprecated makes sense.

lmm · on Sept 19, 2022

Meanwhile one still can't roundtrip regular Japanese without some kind of funky out-of-band signalling. By itself this kind of thing is harmless, but it speaks to poor prioritization from Unicode.

lifthrasiir · on Sept 20, 2022

This is incorrect. I think you defined round-trip as something else, but some character set A providing a round-trip compatibility with other set B means that B can be converted to A and back to B without a loss. And it is one of Unicode's explicit goals to provide a round-trip compatibility with major encodings including Japanese ones.

Han unification only means that when you convert Japanese encodings (B) to Unicode (A), it is not distinguishable from non-Japanese encodings converted to Unicode. This means that the Unicode text doesn't always follow domestic conventions without out-of-band signaling or IVD or so. But if you know that the text was converted from a particular encoding, you can perfectly recover the original text encoded in that encoding.

lmm · on Sept 20, 2022

By that logic any 8-byte encoding is round-trip compatible with all encodings, since however bad the mojibake is, if you know what the original encoding was then you can always just convert back to that.

lifthrasiir · on Sept 20, 2022

Yes, but only under a very wrecked hypothetical definition of "conversion".

josefx · on Sept 20, 2022

To be fair they wanted to keep everything representable with 16 bits and that wasn't going to happen without the Han-unification. The mess when everything still had to move to a 32 bit representation has been far reaching, many programming languages went from exposing code points atomically as "char" to some half encoded nonsense value that just happens to also be a valid standalone value in UTF-16 most of the time and a source of bugs when you least expect it.

umanwizard · on Sept 19, 2022

Why can’t it round-trip Japanese?

lmm · on Sept 20, 2022

"Han Unification" - in Unicode many Japanese characters are represented as Chinese characters that look different (and subjectively ugly). The Unicode consortium's answer is that you're supposed to use a different font or something when displaying Japanese, which is pretty unsatisfying (e.g. if you want to have a block of text that contains both Japanese and Chinese, you can't represent that as just a Unicode string, it has to be some kind of rope of segments with their own fonts, at which point frankly you might as well just go back to bytes-with-encoding which at least breaks very clearly and visibly if you get it wrong).

kevin_thibedeau · on Sept 20, 2022

You can use the deprecated language tag control codes to distinguish unified code points. It is unlikely to be well supported but it is there.

shp0ngle · on Sept 19, 2022

The thing is, this is just a decorative way to write “o”. It’s not a specific letter by any definition.

I can’t speak of other letters that were added in the same batch in 2007. Some of them seam meaningful, I donno, I don’t speak old church slavonic (although I am told it sounds like Croatian, which I understand a little)

http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3194.pdf

sloonz · on Sept 20, 2022

> so what harm is there in one codepoint no one will ever need?

Fonts bloat (do you want a font with 1 million characters in it ? I don’t. Do you want to have to install 1000 fonts having 1000 characters each to be sure to cover all the Unicode table ? I don't).

Lots of issues for everyday programmers (how do you handle weird unicode characters in your validation code ?) potentially leading to security issues (bypassing validation rules by close-but-different characters, phishing…)

bhk · on Sept 19, 2022

Cataloging every doodle ever drawn inline with text by anyone at any time in history would exhaust any finite set of code points.

layer8 · on Sept 19, 2022

> Unicode's mission is to make every document "roundtrip-able".

Only for characters from existing coded character sets.

IshKebab · on Sept 20, 2022

That isn't Unicode's mission though. To get new characters added you have to show that people use it or would use it if it were available.

umanwizard · on Sept 19, 2022

The artist Prince changed his stage name to an unpronounceable symbol for a few years. It appears in more than one document. Should it be added to Unicode?

caf · on Sept 20, 2022

Probably. Maybe you can propose it.

rhplus · on Sept 20, 2022

By the original reasoning, shouldn’t every fancy illuminated character from medieval manuscripts get its own codepoint?

modzu · on Sept 19, 2022

why isnt the artist formerly known as prince in unicode?

Pinus · on Sept 19, 2022

Isn’t there an entire Unicode block for the symbols on the Phaistos disc? Yes: https://en.wikipedia.org/wiki/Phaistos_Disc_(Unicode_block) . I suppose those occur in quite a few documents about the disc, even though the disc itself is the only known document written in those symbols.

PeterisP · on Sept 19, 2022

At the moment this character is used in many documents and databases - including comments in this thread, the article mentioned there, etc.

There could have been a good case not to include it back in 2007, but once it has been included, excluding it would break stuff.

BlueTemplar · on Sept 19, 2022

And updating it rather than adding a new, correct one, might make the current uses confusing ?

Speaking of which, do we have any similar hexagonal symbol ?

reaperducer · on Sept 19, 2022

One historical use in the 1400s doesn't merit a character and never did

One known and surviving use. It is possible that it exists in other places, since the vast majority of the planet's written work has not been digitized. It may also have been used other places that have not survived.

Just because it's not important to you does not mean it is not important.

The fact that is survived for 600 years makes it interesting and worth saving. It is infinitely unlikely that anything you do, write, or say will last that long.

koboll · on Sept 19, 2022

Sure it's possible, but there should be a higher bar than "it's possible it's used more than once" for meriting inclusion in the standard keyboard of billions of devices worldwide.

tsimionescu · on Sept 19, 2022

The thing is, looking at the page, there are many other characters that were not added - the large red С-looking characters, for example. But for some "bizarre" reason, those were not included in Unicode...

Of course, the simple answer is that Unicode actually includes any character that someone cares enough to ask to be added, with rare exceptions.

bhaney · on Sept 19, 2022

> It is infinitely unlikely that anything you do, write, or say will last that long

Ouch

bawolff · on Sept 19, 2022

There are characters in unicode with 0 usages that we dont even know where they came from. E.g. 彁

lifthrasiir · on Sept 19, 2022

While the origin of 彁 will never be certain, there is a good chance that it came from a misinterpretation of 彊 [1]. Why is this not an accepted theory though? Because it is still possible that 彁 did appear in some reference source from the standardization, and neither that source or a source where 彊 does look like 彁 was found.

[1] http://www.asahi.com/special/kotoba/archive2015/moji/2011082...

runarberg · on Sept 19, 2022

idk. when the word Planet was redefined such that Pluto was no longer a planet, it kind of ruined the word Planet. It suddenly wasn’t nearly as useful as a word as it used to (even though now it has a precise meaning). For most people that use the word, it won’t matter (and is actually rather exciting) that they keep discovering new planets in our solar system.

If they’d treat the word characters the same way, it would only serve to confuse and do no favors to the remaining glyphs.

BlueTemplar · on Sept 19, 2022

This is temporary though, soon people will look at you funny if you say that Pluto is a planet - and/or they might not even have heardof it (though of course that is still worth learning about in an History of Science context).

We do NOT keep discovering new planets, rather minor planets (I agree that the term is confusing), more than a million of them discovered in the Solar System now, like the 9007 James Bond.

runarberg · on Sept 19, 2022

It could go either way, it is not always that the scientific meaning wins out, especially not when even scientists don’t find the new definition useful.

When I think of a planet, I think of a world that has active geology that isn’t a moon (I know excluding moons is arbitrary, and perhaps I shouldn’t do that; but hey, that’s language for you). I honestly don’t care about the orbit, and I bet that when most people think about planets they aren’t thinking about the orbit either, let alone whether the planet has cleared the orbit or not. I doubt that will change.

gerikson · on Sept 19, 2022

> When I think of a planet, I think of a world that has active geology

Wouldn't that definition rule out gas giants?

jameshart · on Sept 19, 2022

No just that, but whether or not Mars is still geologically active is still an open question. If you admit planets on the basis that they have a history of geological activity, then Ceres is a planet too.

I don’t think anybody considers geological activity as particularly useful for classifying things as ‘planet’ or ‘not planet’.

runarberg · on Sept 19, 2022

Why shouldn’t Ceres be a planet? If Pluto gets to be a planet then Ceres is definitely a planet.

But there is still active geology on Mars. There is still moisture, winds and ice-caps that are shaping the environment. I consider that to be geologically active.

EDIT: And there are actual experts which consider active geology (or something similar) to be a planet, including Anton Petrov (https://www.youtube.com/watch?v=8-2HxrgqUnM)

jameshart · on Sept 20, 2022

Okay, but then you have to go and figure out which other asteroid and kuiper belt objects are planets.

The 'dwarf planet' distinction helps solve this! There are planets - distinctive in that they have clear orbits - and there are dwarf planets, which can be part of belt systems. This is a useful distinction.

runarberg · on Sept 20, 2022

Sure it is, but the distinction between terrestrial planets and gas giants are also useful, that doesn’t mean the latter aren’t planets.

I think it is fine that there are more objects planets then we can meaningfully count. Loads of things in our language act like that. E.g. a bug can be any number of things, and you know what a bug is by just talking about it. If some insect society then comes up with a meaningful definition of bugs which excludes spiders, that definition isn’t really doing the average user of that word any favor.

runarberg · on Sept 19, 2022

Yeah, probably strictly... But I’m not a planetary scientist. I’m merely a user of language, and I don’t need to be rigorous in my definitions. And to me the weather patterns on Jupiter is an interesting feature enough to count as geology (even though it is probably not strictly a geology).

echelon · on Sept 19, 2022

This thread on HN won't make sense in the future if the Unicode body replaces ꙮ

Make a new character!

nerfhammer · on Sept 19, 2022

why not make an additional eye a diacritic mark so you can just add an arbitrary number of eyes

martin_a · on Sept 19, 2022

Uff.

I'm not sure we have space for another glyph in Unicode. Looks pretty packed in here...

BlueTemplar · on Sept 19, 2022

UTF-8 is still more than 80% empty, and can be potentially extended...

colejohnson66 · on Sept 19, 2022

Theoretically, UTF-8 can encode up to 31 bits (U+7FFF'FFFF)[0], but for compatibility with UTF-16's surrogates, it's officially capped to 21 bits with the max being U+10'FFFF[1]. That decision was made November 2003, so there's two decades of software written with hard caps of U+10'FFFF.

[0]: https://www.rfc-editor.org/rfc/rfc2279

[1]: https://www.rfc-editor.org/rfc/rfc3629#section-3

jotato · on Sept 19, 2022

My thought as well

baybal2 · on Sept 19, 2022

Unicode basic rule is that character definitions never ever change, even when enumerated erroneously.

Arnt · on Sept 19, 2022

Yes, but this is a change either way, because that codepoint's definition referred to that character. Either the reference or the description of the appearance has to change.

echelon · on Sept 19, 2022

   ꙮ ꙮ 
  ꙮ ꙮ ꙮ 
   ꙮ ꙮ

ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ

Make a new character. Updating the existing character ruins the meaning of all previous usages.

It's like trying to change an API. Don't disrespect your existing users. Make a new version.

(ꙮ ͜ʖꙮ)

Think of all the ASCII art this botches. That has to have some historical importance to the Unicode standards body.

(⌐ꙮ_ꙮ)

For scholarly digital (unprinted) documents where the correct character rendering matters, erroneous past usages can be trivially found with grep, a date search, and easily corrected. The domain experts will familiarize themselves with this issue and fix the problem. Don't take a shotgun to it!

This message wꙮn't have the ꙮriginally intended meaning if the characters are updated from underneath.

ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ