Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I suspect that the holy grail here is figuring out how to break the input into a sequence of morphemes and non-morpheme lexical units.


What do you mean by non-morpheme lexical units? Syntactic particles, units too small to be morphemes? Lexical items that contain multiple morphemes?

In either case, isn't this something we already do well?


Punctuation, for example.

And no, at least for the languages with which I'm familiar SOTA tokenizers tend to only capture the easy cases.

For example, the GPT-4 tokenizer breaks the first sentence of your post like so:

  What/ do/ you/ mean/ by/ non/-m/orp/heme/ lexical/ units/?
Notice how "morpheme" gets broken into three tokens, and none of them matches "morpheme"'s two morphemes. "Lexical" and "units" are each a single token, when they have three and two morphemes respectively.

Or in French, the word "cafetière" gets chopped willy-nilly into "c/afet/ière". The canonical breakdown is "cafe/t/ière".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: