With such a high throughput because of sparsity, I'm particulary interested in distilling it into other architectures. I'd like to try a recurrent transformer when I have the time
I think what I want to do, is have a dodgy local LLM that picks up the context that the user is speaking to the LLM, and then enables it for 20 minutes or so.
reply