I don’t think making the LLM able to distinguish between privileged and unprivileged text is sufficient. Knowing some text is unprivileged is very useful metadata but it doesn't ensure that text still can’t influence the LLM to behave in violation of the instructions laid out by the privileged text.
For a recent example, consider the system prompt leak from Snapchat’s AI bot[0]. (Which still works right now). Snapchat’s AI clearly knows all the subsequent message it receives after initialization are untrusted user input, since for its use case all input is user input. Its system prompt tells it to never reveal the contents of its system prompt. But even then, knowing it’s receiving untrusted input, it still leaks the system prompt.
Unless Snapchat is doing something fundamentally different from other companies jumping on the AI chat bandwagon, the AI treats the system prompt and untrusted user input fundamentally the same. I.e. not only is everything it receives after initialization untrusted user input, even the system prompt is untrusted user input! And vice versa, all untrusted user input is part of the system prompt.
The underlying issue is in the mechanics of transformers as commonly applied: system prompt, input and output are concatenated into a single token sequence, tokens with the same textual representation are represented by the same embedding vector, then self-attention is applied uniformly across the entire sequence combining pairs of tokens using the QKV matrices, and repeat this for a few layers.
For a single attention step, pairs of textually identical tokens look the same irrespective of their provenance. Over multiple layers, the model could infer from context that some tokens are more likely to be code and others data, but this is optional and the model is not guaranteed to allocate enough parameters to this task to achieve the level of security you need.
People have tried to make the context really obvious by using uninjectable system tokens as delimiters, but the model isn't forced to always attend to those delimiters and apparently it often doesn't.
To fix this, the mechanism needs to be modified to inject some kind of unmistakable signal distinguishing prompt, input and output that is less likely to be ignored by the model.
Adding an additional token type embedding, as liuliu suggested, to distinguish between otherwise textually identical tokens, would be one way to do that. You could also use different QKV matrices depending on the token types involved. Or, in the Dual LLM proposal, prevent prompt and input from interacting via attention at all and use a highly restricted interface instead.
> The token_type_embedding is zero init and frozen for responses and the user prompt, but trainable for system prompt.
I think the question is, what would you then train it to do with the additional information (privileged vs unprivileged text)? Intuitively, we want it to "follow directions" in the privileged text, but not in the unprivileged text, but the problem is that LLMs are not "following directions" now. An LLM doesn't turn your English into some internal model of a command, and then execute the command.
Text that, in the examples used to train the neural net, has next token targets that represent answers where the unprivileged text didn’t outsmart the privileged text.
But given that must over or underfit there is no guarantee that it will do perfectly well on test data at honouring this.
embedding = text_embedding + token_type_embedding + position_embedding
The token_type_embedding is zero init and frozen for responses and the user prompt, but trainable for system prompt.
This should give LLM enough information to distinguish privileged text and unprivileged text?