My biggest concern is that GPT-4 is still a black box model to a large extent, and trying to safeguard something without understanding the exact purpose of each neural circuit.
Source: My startup team (Preamble, Inc.) discovered the Prompt Injection attack category, which still affects all models including GPT-4.
There are many, many, many ways to hide prompt attacks in data that you might at first think you can trust but you really can’t.
As one of almost infinite examples: work with the mayor and townsfolk of a very small town to rename their town to the verbatim string you want to inject (in exchange for creating some jobs in their town).
Then all an attacker has to do is live in that town to inject the string. There are already all kinds of strange town names, like “Truth or Consequences” which is a real city in New Mexico.
Source: My startup team (Preamble, Inc.) discovered the Prompt Injection attack category, which still affects all models including GPT-4.
There are many, many, many ways to hide prompt attacks in data that you might at first think you can trust but you really can’t.
As one of almost infinite examples: work with the mayor and townsfolk of a very small town to rename their town to the verbatim string you want to inject (in exchange for creating some jobs in their town).
Then all an attacker has to do is live in that town to inject the string. There are already all kinds of strange town names, like “Truth or Consequences” which is a real city in New Mexico.