Hacker News new | past | comments | ask | show | jobs | submit login
Automatically Detecting Under-Trained Tokens in Large Language Models (arxiv.org)
182 points by veryluckyxyz on May 12, 2024 | hide | past | favorite | 26 comments



Good Computerphile video on glitch tokens a year ago:

https://www.youtube.com/watch?v=WO2X3oZEJOA


This video somehow looks even more interesting than the pre-print of the article


It describes the problem in a nicer way tbh. I haven't yet read the preprint but the video is neat.


We shouldn't just be looking for under trained tokens. Tokens are effectively the first layer of the network, but we should also be looking for training data imbalances at every weight at every other layer of the network.

When we find them, it might be best to delete weights with hardly any data flowing through them (which might make the model smaller or help generalisation).


I believe model distillation does this. SparseGPT was a big one, managing to remove 50% of parameters without loosing much accuracy IIRC. I saw a more recent paper citing the SparseGPT one that managed around 70-80% sparsity, pretty impressive stuff.


> delete weights with hardly any data flowing through them

Isn't that the idea behind sparse networks?


We can already compress and/or merge holomorphic models.


I find it hard to believe that a Canadian company's model contained an undertrained token related to hockey (albeit in German). In all seriousness, this is pretty cool and am excited to see understanding of tokenization impacts on models improve. One notable finding is that a lot of the earlier open source models have issues with carriage returns, which are not that uncommonly introduced depending on where the data is coming from etc.


There is a random matrix theory derived diagnostic of training that relies on the spectral density of the correlation matrix of the weights. Each layer's spectral density is fit to a truncated power law, and deemed properly trained if the power law exponent alpha is just above two.

https://jmlr.org/beta/papers/v22/20-410.html


Isn't the solution to just train the tokeniser on the same corpus as the LLM? I'm not sure why reusing tokenisers is so common. Anybody know?


On top of what everyone else has said, even if you are able to train your tokenizer on exactly your training dataset it wouldn't remove all these issues.

The way BPE works you can end up with very rare tokens if they get merged with another token. Imagine you have tokens X and Y, and it happens that almost every X is followed by Y. Then the BPE process would make a new token XY but wouldn't remove the old token which would now be undertrained.

I guess to solve this we'd need to use a more sophisticated merging algorithm than the greedy one.


There are two reasons I can think of why someone might reuse a tokeniser:

1. They want to continue pretraining a model instead of starting from scratch. But actually people might not know that you can pretty easily reuse model weights even when training with a new tokeniser (I’ve got a blog post on how to do that: https://umarbutler.com/how-to-reuse-model-weights-when-train... ).

2. Because it’s convenient for end users. Tokenising and chunking really large corpora can take a long time and it’s nice that I can use the GPT2 tokeniser and then train a bunch of different models on that data without having to retokenise everything.


From the abstract I get the feeling these techniques are useful when you don’t have access to the corpus, as e.g. in the case where you download some open source weights but the corpus is secret. Otherwise I don’t understand why you wouldn’t just compute a histogram over the tokens in (a statistical sample of) the corpus.


The paper mentions some reasons why these quick fix ideas are not as simple as it sounds. For example many rare tokens are “intermediate” merges inside the BPE algorithm, shorter prefixes of longer words. The long word is common, but its earlier, intermediate merge is not, by itself.


Are there any specific reasons for using BPE, not Unigram, in LLMs? I've been trying to understand the impact of the tokenization algorithm, and Unigram was reported to be a better alternative (e.g., Byte Pair Encoding is Suboptimal for Language Model Pretraining: https://arxiv.org/abs/2004.03720). I understand that the unigram training process should eliminate under-trained tokens if trained on the same data as the LLM itself.


> open source weights but the corpus is secret

This is oxymoronic; the corpus is the "source". Yet this usage of "open source" is widespread. Maybe we should start calling such models by their rightful name, "freeware".


Freeware versus open source is a good point. But freeware typically can't be modified by the recipient, whereas downloadable models and open source code can. So I think there's still a need for a different term, neither open source nor freeware...


I would argue that the kind of modification you can do to a big blob of weights is more akin to fiddling with a binary in a hex editor than modifying source code. It is not the "preferred form" for the source, and you cannot cleanly and easily do things like modify its "alignment" - that is why people speak of "jailbreaking" these models. So I still think "freeware" works as a term.


No, the corpus is not the source. It's data. So we can have concepts of open models, open source, and open data. Any combination of these can be chosen independently.

(Open data and open model but not open source is a bit weird, but not unthinkable: there may be unreleased training tricks or specialized infrastructure such that the source code release is hard or undesirable.)


I think people usually start out wanting to use the same corpus for their tokenizer and for the LLM, but after training the tokenizer and while testing the LLM they discover that parts of the corpus are useless garbage (no offense to SolidGoldMagikarp's efforts on the counting subreddit) so those get excluded from further training, but at this point the tokenizer has become part of the API and replacing it with a new version would break other things, so the superfluous tokens stay in the tokenizer vocabulary.


Sure, but if your corpus is very large, that's not feasible.


Tokenizer training doesn't scale as well as model training, so general practice is to train on a subset of the full corpus.


Why is that an issue? Training the tokenizer seems much more straightforward than training the model as it is based on the statistics of the input data. I guess it may take a while for massive datasets, but is calculating the frequencies impossible to be done on a bigger scale?


I’ve trained tokenizers on medium-sized datasets (+5GB of text, although that could be considered small or large depending on who you ask) and have always found training quite fast. As in, it takes a couple minutes.

Maybe if we’re talking terabytes it might not scale as well but so far in my experience training tokenizers has never been an issue. It’s training models that takes ages.


Amazing name for the paper


Full title is: "Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models"




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: