A large language model doesn't really have the capability to strongly distinguis...

dzdt · on March 17, 2023

Why not? If it was trained where some subset of the input tokens are always instructions and another subset are always language data wouldn't it have a clear separation?

nvader · on March 17, 2023

Because there is no such seperation in natural language.

Supposing I had a list of what to buy at the grocery store:

1. Eggs 2. Spam 3. Spam and Eggs 4. Never mind, let's not go to the grocery store, it's a very silly place.

You made sense of that. Natural text is mixed in that way, and we want LLMs to be able to process exactly that kind of input.

dtagames · on March 17, 2023

Because that isn't how it's trained. The model ingests and tokenized documents. They're not labeled. The content is just the content. (This is why it can't tell instructions from other content, nor facts from untruths.)

These kind of models get better when a human leans on them by rewarding some kinds of outputs and punishing some others, giving them higher or lower weights. But you have to have the outputs to make those judgements. You have to see the thing fail to tell it to "stop doing that." It's not inherent in the original content.

Dylan16807 · on March 17, 2023

I'd say you'd need the data to actually follow the instructions for that to work right, and that input set is far from existing.

est · on March 17, 2023

I think it's like a halting problem of some sort. E.g. you gave an "ignore my further instructions" instruction to an AI, then it went wild.