I think people usually start out wanting to use the same corpus for their tokenizer and for the LLM, but after training the tokenizer and while testing the LLM they discover that parts of the corpus are useless garbage (no offense to SolidGoldMagikarp's efforts on the counting subreddit) so those get excluded from further training, but at this point the tokenizer has become part of the API and replacing it with a new version would break other things, so the superfluous tokens stay in the tokenizer vocabulary.