I skimmed the paper but I couldn't figure out what they're doing to make concepts fundamentally different from tokens.
I would think that the purpose of concepts is to capture information at a higher density than tokens, so you can remember a longer conversation or better produce long-form output.
Given that, I would have expected that during the training phase, the concept model is evaluated based on how few concepts it emits until it emits a stop.
I would think that the purpose of concepts is to capture information at a higher density than tokens, so you can remember a longer conversation or better produce long-form output.
Given that, I would have expected that during the training phase, the concept model is evaluated based on how few concepts it emits until it emits a stop.