I think there is an issue in vocabulary size though.
If I'm correct in thinking that the OP is proposing that what Llama does ("Notably, we split all numbers into individual digits") explicitly this allows the model to treat a number as a sequence of digits so it can learn how math works.
This is especially important with very rare numbers. Take a number the GPT has never (or hardly ever) seen in its training data:
832408345770928764
The GPT-3 tokenizer tokenizes[1] that into:
83,24,08,345,7,709,287,64
To some degree this is forced to occur by the use of raw BPE encoding and the vocabulary size (175K in the case of GPT-3).
Now consider the string:
832408345770928764 + 37
The model presumably has learnt something like "if all tokens are in this area (where the area is "numbers" in the token-space) and they are followed by a + sign then we don't just append the string, instead we swap the last token for another one"
But of course this is insufficient in this case - it needs to learn carrying rules to also increment the next token. As is speculated in https://arxiv.org/abs/2212.10559, it's possible there's a relationship between the depth of the model and the length of chained rules it can learn, and because of the number of multi-digit tokens it has learnt these rules are unnecessarily complex, and incomplete.
If these tokens were instead single digits the rules would be much simpler, and it's possible the model could actually learn the real rules of math instead of the subset of semi-memorized things it has at the moment.
If I'm correct in thinking that the OP is proposing that what Llama does ("Notably, we split all numbers into individual digits") explicitly this allows the model to treat a number as a sequence of digits so it can learn how math works.
This is especially important with very rare numbers. Take a number the GPT has never (or hardly ever) seen in its training data:
832408345770928764
The GPT-3 tokenizer tokenizes[1] that into:
83,24,08,345,7,709,287,64
To some degree this is forced to occur by the use of raw BPE encoding and the vocabulary size (175K in the case of GPT-3).
Now consider the string:
832408345770928764 + 37
The model presumably has learnt something like "if all tokens are in this area (where the area is "numbers" in the token-space) and they are followed by a + sign then we don't just append the string, instead we swap the last token for another one"
But of course this is insufficient in this case - it needs to learn carrying rules to also increment the next token. As is speculated in https://arxiv.org/abs/2212.10559, it's possible there's a relationship between the depth of the model and the length of chained rules it can learn, and because of the number of multi-digit tokens it has learnt these rules are unnecessarily complex, and incomplete.
If these tokens were instead single digits the rules would be much simpler, and it's possible the model could actually learn the real rules of math instead of the subset of semi-memorized things it has at the moment.
[1] https://platform.openai.com/tokenizer