I suspect that the holy grail here is figuring out how to break the input into a...

thaumasiotes · on Oct 24, 2024

What do you mean by non-morpheme lexical units? Syntactic particles, units too small to be morphemes? Lexical items that contain multiple morphemes?

In either case, isn't this something we already do well?

bunderbunder · on Oct 24, 2024

Punctuation, for example.

And no, at least for the languages with which I'm familiar SOTA tokenizers tend to only capture the easy cases.

For example, the GPT-4 tokenizer breaks the first sentence of your post like so:

  What/ do/ you/ mean/ by/ non/-m/orp/heme/ lexical/ units/?

Notice how "morpheme" gets broken into three tokens, and none of them matches "morpheme"'s two morphemes. "Lexical" and "units" are each a single token, when they have three and two morphemes respectively.

Or in French, the word "cafetière" gets chopped willy-nilly into "c/afet/ière". The canonical breakdown is "cafe/t/ière".