That's a good point. It may be possible to directly upload predicate-type info i...

wongarsu · on March 28, 2024

The opposite is also exciting: build a loss function that punishes models for storing knowledge. One of the issues of current models is that they seem to favor lookup over reasoning. If we can punish models (during training) for remembering that might cause them to become better at inference and logic instead.

qlk1123 · on March 29, 2024

I believe it will add some spice to the model, but you shouldn't go too far at that direction. Any social system has a rule set, which has to be learnt and remembered, not infered.

Two exmaples. (1) grammars in natural languages. You can just see in another commenter here uses "a local maxima", and then how people react to that. I didn't even notice becuase English grammar has never been native to me. (2) Mostly, prepositions between two languages, no matter how close they are, don't have a direct mapping. The learner just has to remember it.

kossTKR · on March 28, 2024

Interesting. Reminds me of a sci-fi short i read years ago where AI's "went insane" when they had too much knowledge because they'd spent too much time looking through data and get a buffer overflow.

I know some of the smaller models like PHI-2 are training for reasoning specifically before by training on question answer sets, though this seems like the opposite to me.

azinman2 · on March 29, 2024

But how to do when pre training is basically predict the next token?

AaronFriel · on March 28, 2024

It indeed is. An attention mechanism's key and value matrices grow linearly with context length. With PagedAttention[1], we could imagine an external service providing context. The hard part is the how, of course. We can't load our entire database in every conversation, and I suspect there are also challenges around training (perhaps addressed via LandmarkAttention[2]) and building a service efficiently retrieve additional key-value matrices.

The external service vector database may require tight timings necessary to avoid stalling LLMs. To manage 20-50 tokens/sec, answers must arrive within 50-20ms.

And we cannot do this in real-time, pausing the transformer when a layer produces a query vector stalls the batch, so we need a way to predict queries (or embeddings) several tokens ahead of where they'd be useful and inject the context in when it's needed, and to know when to page it out.

[1] https://arxiv.org/abs/2309.06180

[2] https://arxiv.org/abs/2305.16300