Hacker News new | past | comments | ask | show | jobs | submit login

早安 is made up of 2 characters but is a single word. If you fall into the trap of thinking 1 character = 1 word, you won't understand a thing. In this case you'd have thought it meant "early safe" instead of "good morning".



Okay, you make a good point. Let's look back at the GGP's comment though:

> Same with Chinese language, thus lexing and parsing requires knowing many more words than in languages with spaces between words.

In English, can you get away with knowing the meaning of "good" and "morning" and not "good morning", and know that I'm greeting you instead of commenting on the quality of this morning?


Good morning is a bad example because it has a colloquial meaning that is a least a little idiomatic. Most other words/phrases in English don’t have this effect, while many Chinese words are like 早安. 了解, for example, can’t even be pronounced without correctly parsing the word.


Okay, I concede that I may have forgotten that Chinese has its exceptions too. 了解 is indeed a good example. There are plenty in English though. Even with context, sometimes I have to really pause and think whether to pronounce read as red or reed (I read it just fine, I read English just fine).

Where I've had pain specifically with Thai is that I can't even know where a syllable begins and ends until I read a few "syllables" together and decide whether some vowels go with the consonant in front or behind it, and whether some an -ar should be pronounced as an -aan.


Chinese has pretty regular rules about grouping characters into words though, as most compounds are 2-characters, or a 4-character idiomatic phrase. Even if I know only half the characters in a sentence, I can usually guess the word boundaries correctly. It's not 100% reliable, but good enough to avoid confusion.


I guess it really depends on "dialect". Try that with Cantonese :)

As mentioned in another comment, single syllable words are much more common in Cantonese, and word combinations are much more "free" in the sense that there are a lot more ambiguity as to what counts as a "word" and what is merely two single-character-words idiomatically used together. There are also cases where grammatical constructs (and also foul words) are inserted in between a two-character word/idiomatic combo, and sometimes the characters are reversed, to the extent that it used to be a meme: https://evchk.fandom.com/zh/wiki/Y%E5%B7%B2x

It's gotten to a point where, after thinking about it for a couple years, I've come to believe that segmentation on Cantonese is a fool's errand...

Of course, there's also classical Chinese where most of the time a character is a word.


i think you're on to something about cantonese, but it's also true of mandarin. segmentation of words in chinese in general seems inherently messier than segmentation in english. also look at stuff like abbreviations: is 北大 one word? is it an abbreviation for 北京大学 the same way Caltech is an abbreviation for california institute of technology? is it just two single character words, each of which is an abbreviation? i think its much less clear than english


Segmentation in Mandarin is easier due to tendency of the language to use 2+ characters for words. With a high quality wordlist you will go a long way.

The problem with proper nouns is that they don't end up in dictionaries, same with slang and other terms that for reasons don't end up in dictionaries.

The additional problem with Cantonese is that there's a larger class of words where the constituent characters can move around as if they were words themselves. Even for a native speaker with some experience in lexicography, it can be difficult to determine word boundaries as there are many cases where a word with characters X+Y can be interpreted as just word X and word Y with some idiomatic meaning. This issue is more pronounced in Cantonese because there are more single character words in active use.

I've actually done this before. My experience is that naive segmentation on Mandarin text with wordlist is probably 80+% accurate, while using the same algorithm in Cantonese text (with cantonese wordlist) will definitely end up "wtf".


The same problem exists in Japanese FWIW, whose speakers like to make the same sorts of abbreviations despite not having a bisyllabic meter like Mandarin does. Japanese is somewhat helped by having multiple orthographies, however.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: