Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A large language model doesn't really have the capability to strongly distinguish instructions from data, even if you separate them perfectly.


Why not? If it was trained where some subset of the input tokens are always instructions and another subset are always language data wouldn't it have a clear separation?


Because there is no such seperation in natural language.

Supposing I had a list of what to buy at the grocery store:

1. Eggs 2. Spam 3. Spam and Eggs 4. Never mind, let's not go to the grocery store, it's a very silly place.

You made sense of that. Natural text is mixed in that way, and we want LLMs to be able to process exactly that kind of input.


Because that isn't how it's trained. The model ingests and tokenized documents. They're not labeled. The content is just the content. (This is why it can't tell instructions from other content, nor facts from untruths.)

These kind of models get better when a human leans on them by rewarding some kinds of outputs and punishing some others, giving them higher or lower weights. But you have to have the outputs to make those judgements. You have to see the thing fail to tell it to "stop doing that." It's not inherent in the original content.


I'd say you'd need the data to actually follow the instructions for that to work right, and that input set is far from existing.


I think it's like a halting problem of some sort. E.g. you gave an "ignore my further instructions" instruction to an AI, then it went wild.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: