Has anyone experimented with mixing outputs from LLm's on a per-token basis?
Ie. easy tokens can be provided by a cheap-to-run model, and hard tokens are given by an expensive to run model?
A model could be used to decide when it is worth running the expensive model, based on the inputs, output so far, and probability distribution of the output of the cheap model.
For example, "Q: If I have 3 bananas and eat none, then how many bananas do I have?"
"A: You would have 3 bananas left, since you started with 3 and didn't eat any"
The "3" would come from the big model, while the rest all came from a small model.
Clever idea. I think you would have to recompute the context (ie embed the prior tokens) every time you swapped models because the weight distributions would be different for each model. Going from big->small might make this overhead worth it, but going back from small->big would assuredly be very costly.
Ie. easy tokens can be provided by a cheap-to-run model, and hard tokens are given by an expensive to run model?
A model could be used to decide when it is worth running the expensive model, based on the inputs, output so far, and probability distribution of the output of the cheap model.
For example, "Q: If I have 3 bananas and eat none, then how many bananas do I have?"
"A: You would have 3 bananas left, since you started with 3 and didn't eat any"
The "3" would come from the big model, while the rest all came from a small model.