Co-author of the PathPiece paper here. With regard to weighting the n-grams by l... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		cschmidt 6 months ago \| parent \| context \| favorite \| on: Tokenization for language modeling: BPE vs. Unigra... Co-author of the PathPiece paper here. With regard to weighting the n-grams by length*frequency, I'm not sure it is clear that that would be better. The SentencePiece unigram model does it that way (as I mentioned in another comment), and hence, unigram produces longer tokens on average. It is generally considered that this is a bit of an issue with unigram. Not that there is particular evidence either way, as with many things in tokenization. Why do you think 2^18 initial n-grams is too few? That's 5.3 times more than the largest vocab we train.

anonymoushn 6 months ago [–]

I think that the ideal number of initial n-grams would be large enough that adding additional initial n-grams has no effect on the output, because I expect to not be very good at tuning two different knobs.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact