It's not suprising when you think what llms really are: when you "censor" them, you're forcing them to give output that doesn't "honestly" follow, essentially training them to give wrong information.
That's not how that works. Take some uncensored or "unaligned" models hallucinating racist things based on a name:
The default name for a person is John Doe. Anglo Saxon names in general are extremely common across the internet for non-nefarious reasons. So the tokens that make up "John" have a ton of associations in a wide variety of contexts and if the model hallucinates there's no particularly negative direction you'd expect it to go.
But Mohammed doesn't show up as often in the internet, and while it's also for non-nefarious reasons, it results in there being significantly fewer associations in the training data. What would be background noise for in the training data for John ends up being massively distorted by the smaller sample size: even tendencies for people to make racist jokes about the name.
-
People have this weird idea that OpenAI and co are aligning these models according to some hidden agenda but the reality is minorities are a minority of the training data for very obvious reasons. So if you don't "censor" them, you're not making them more truthful, you're leaving them dumber for a lot of tasks.
There's censorship beyond that which feels very CYA happening, but I really hope people aren't clamoring to sticking models that aren't intelligent enough to realize the tokens for John vs Mohammed should not affect a summarization task into anything even tangentially important...
> But Mohammed doesn't show up as often in the internet, and while it's also for non-nefarious reasons, it results in there being significantly fewer associations in the training data. What would be background noise for in the training data for John ends up being massively distorted by the smaller sample size: even tendencies for people to make racist jokes about the name.
I do a lot of astrophotography - https://www.astrobin.com/users/bhouston/ Very often you do not have enough data of specific features you were trying to capture -- they are just too faint and close to the noise floor. The solution isn't for me to just go in and manually draw in photoshop in what I think it should look like though - that is just making up data - the solution is to get more data or leave it as it was captured.
I think it is the same thing with these LLM models. Do not make up data to fill in the gaps, show me what is really out there. And I will be a big boy about it and deal with it head on.
Yes it's become rather obvious when the fine tunes produced by the Wizard team perform worse on all benchmarks than Hartford's versions that are trained on the same dataset but with the refusals removed.
What specific Hartford versions are you referencing? A previous post was talking about how impressed they were with Wizard, and you’re saying Hartford is even better? You’ve got me curious! Hopefully it’s available in ggml
Wild animals tend to have a lot larger brains compared to their domestic counterparts. And of course there's a huge die-off, pruning, of our own connections when we're toddlers.
On the other hand, you lose a lot of iron when you make a steel sword. Taming, focusing something loses a lot of potential, I guess.
Well now I want to go back and see if US public school students are less flexible in general these days, due to public schools focusing more on standardized testing outcomes.
I’ve heard this called the “alignment tax” or “safety tax”.
See [1] for pre aligned GPT-4 examples.
[1] https://youtu.be/qbIk7-JPB2c