Has anyone experimented with mixing outputs from LLm's on a per-token basis? Ie....

kippinitreal · on July 19, 2023

Clever idea. I think you would have to recompute the context (ie embed the prior tokens) every time you swapped models because the weight distributions would be different for each model. Going from big->small might make this overhead worth it, but going back from small->big would assuredly be very costly.

GPUboy · on July 19, 2023

This is the topic of the language models as tool makers paper https://arxiv.org/pdf/2305.17126.pdf

reportgunner · on July 19, 2023

So you run a model to avoid running a model? Math doesn't seem to add up.

philovivero · on July 19, 2023

Running a model isn't binary, it's per amount of time spent generating tokens.