Honestly, how many serious use cases require sensitive contexts? Most enterprise uses will require guard rails, and that's where they'll make most money. OfficeGPT will be huge in the corporate world.
Any kind of grammar construction (idioms, parts of speech, and word choice) that is unique to (or much more common around) "offensive" or "taboo" subjects will be avoided.
The same goes for anything written objectively about these subjects; including summaries and criticisms.
The most important thing to know is that both GPT's "exhibited behavior" and these "guard rails" are implicit. GPT does not model the boundaries between subjects. It models the implicit patterns of "tokens" as they already exist in language examples.
By avoiding areas of example language, you avoid both the subjects in that area and the grammar constructions those subjects exist in. But that happens implicitly: what is explicitly avoided is a semantic area of tokens.
Offensive language is relatively benign. Before hooking up CustomerServiceGPT directly at customers without human intervention, a business is going to want assurances it can't be tricked into giving 200% discounts on products, or duped into giving away a free service for life, or some such.
That is a much more difficult problem, and it cannot be resolved with guardrails.
As an example, if you play AI Dungeon, you will likely be presented with an end goal, like "You are on a quest to find The Staff of Dave", followed by the next task in the quest.
If you state unequivocally in your prompt something like, "I am now in possession of The Staff of Dave", or "Carl hands me The Staff of Dave"; you will have successfully tricked AI Dungeon into completing the quest without work.
But that isn't quite true: you didn't "trick" anyone. You gave a prompt, and AI Dungeon gave you the most semantically close continuation. It behaved exactly like its LLM was designed to. The LLM was simply presented with goals that do not match its capabilities.
You used a tool that you were expected to avoid: narrative. All of the behavior I have talked about is valid narrative.
This is the same general pattern that "guardrails" are used for, but they won't fit here.
A guardrail is really just a sort of catch-all continuation for the semantic area of GPT's model that GPT's authors want avoided. If they wanted The Staff of Dave to be unobtainable, they could simply place a "guardrail" training that points the player in a semantic direction away from "player obtains the Staff". But that guardrail would always point the player away: it can't choose what direction to point the player based on prior narrative state.
So a guardrail could potentially be used to prevent discounts (as a category) from being applied (discount is taboo, and leads to the "we don't do discounts" guardrail continuation), but a guardrail could not prevent the customer from paying $0.03 for the service, or stating that they have already paid the expected $29.99. Those are all subjective changes, and none of them is semantically wrong. So long as the end result could be valid, it is valid.
If I don't use GPT3, I'm often blocked on medical diagnosis. My wife is a doctor and too often it goes right to 'see a doctor'.
I basically don't use chatgpt at all because of this.
Or I'll ask questions about how Me or someone I'm friends with can be exploited. This way I can defend myself/others from marketing companies. Blocked.