Hacker News new | past | comments | ask | show | jobs | submit login

Hi!

Here is some information about romanisation of Cantonese if you are interested:

Romanisation system for Cantonese has an interesting history! Yale romanisation system [1] is (IMO) the most readable and also later on refined as Jyutping [2], another method used in more academic contexts which IMO is less readable (both used in GBoard as Cantonese input methods). However most persons and place names in HK use older system [3] developed in 1880s by Christian missionaries.

When people use Cantonese romanisation as part of their casual text chats on instant messaging or social media platforms, it’s usually a mix of both systems [1, 3], but rarely [2] but without the tone information (so lots of many-to-one mappings), mixed in with bits of English, making it hard to understand (even for a local Hong Kong person) without having good prior context of the entire conversation.

[1] https://en.wikipedia.org/wiki/Yale_romanization_of_Cantonese

[2] https://en.wikipedia.org/wiki/Jyutping

[3] https://en.wikipedia.org/wiki/Standard_Romanization_(Cantone...




Somewhat off topic: any chance you know how come Google doesn't have an explicitly cantonese model for translation?


Not a Googler so I can only guess. But it seems like Google did try to treat Cantonese as a Chinese variant in the past, eventually they dropped it probably because they realised they're too different.

I know Google is actively working on the Cantonese version of Google Assistant, though not sure when it'll be officially released.


It is a variant of Chinese though. Chinese is a language family, not a language -- which includes Mandarin, Cantonese, Hakka, et. al.


Whatever it is, Cantonese has different pronunciation, vocabulary and even grammar from Mandarin. Which means it takes a non trivial amount of work to adapt a language model designed for one to the other.

Source: I'm a native speaker of one and fully fluent in the other.


afaik Google needs a multilingual corpus. so if Cantonese is mostly written using Chinese characters, the corpus will be in Chinese characters.

and if written Cantonese is mostly informal (conversation, shop signs) it will not often be multilingual. so the approach that has worked for most languages wouldn't work then.

and it surely wouldn't work for a completely different, lossy orthography - without independent training.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: