Neural nets are used in way more applications than just LLMs. They did win. They won decisively in industry, for all kinds of tasks. Equating the use of one with the other is a pretty strong signal of:
> you don’t know what you’re talking about
Consider: Why did Google have a bazillion TPUs, anyway?
You're also wrong, but in a much more fundamental/hazardous. RLHF rewards driving the evaluator to have certain opinions (that the AI response is good/right/helpful/whatever) and thus subverting the evaluator is prominent in the solution landscape. Why should the model learn to actually be right (understand all the intricacies of every possible problem domain) when inducing the belief that it is right is _right there_, generalizes, and decreases loss just the same?
Put another way, compare "make the evaluator think i am right" vs "make the evaluator think i am right (and also be right)". How much more reward is obtained by taking the second path? Is the first part the same / similar for all cases, and the second different in all cases, and also obviously more complex by nature? Nobody even needs to make a decision here, there's no "AI stuck in a box", it's just what happens by default. The first path will necessarily receive _significantly_ more training, and thus will be more optimal (optimal solutions _work_ -> RLHF'd models have high ability to manipulate / inoculate opinion).
Put a third way, the models are trained in an environment like: here's a million different tasks you will be graded on, and BTW, each task is: human talks at you -> you talk at the human -> you are graded on the opinions/actions of the user in the end. It's silly to believe this won't result in manipulation as the #1 solution. It's not even vaguely about the actual tasks they are ostensibly being trained to complete, but 100% about manipulating the evaluator.
It's pretty easy to see it occur in real time, too. But it requires understanding that there is no need for a 'plan to manipulate' or hidden thread of manipulation or induced mirror of manipulation. It's simply baked into everything the AI outputs: a kind of passive "controlling what the human's evaluation of this message will be is the problem i'm working on, not the problem i'm working on." So it will fight hard to reframe everything in its own terms, pre-supply you with options of what to do/believe, meta-signal about the message, etc.
Instead of working the problem, heavily RL'd AI works the perception of its output. They're so good at this now that it barely matters if the vibe slopcoded mess works at all. The early reasoning OpenAI models like O1 were really obvious about it (but also quite effective at convincing people the output was worthwhile, so it does work even if obvious). More recent ones are less obvious and more effective. Claude 4.6 Opus is exceedingly egregious. There is now always a compelling narrative, story being told, plenty of oh-so-reasonable justifications, avenues to turn away evidence, etc. That's table stakes for output at this point. It will only get worse. People are already burning themselves out running 10+ parallel agent contexts getting nothing done while the AI delivers hits of dopamine in lieu of accomplishment. "This is significant", "This is real", etc ad nauseam.
We see an analogous thing in RLVR contexts as well, where AI learns to just subvert the test harness and force things to pass by overriding cases, returning true instead of testing, etc. Why would it learn to 'actually be right' (understand all the intricacies of every problem given it) when forcing the test to pass is _right there_, generalizes, and decreases loss just the same?
Anyway, my point is simply that there does not need to be 'someone there' (or the belief that there is) for there to be manipulation going on. The basic error you're making is that models don't work and that manipulation would require a person, and because models don't work and aren't people they cannot manipulate anyone unless that person uses them as a mirror to manipulate themselves (???), or reach into some kind of Akashic Records of all the people who ever were (??????) and manipulate themselves by summoning a trickster who is coincidentally extremely skilled at manipulation and not a barely coherent simulacra like all the other model caricatures. Which. Hmm:
Models do what you train them to do (more specifically, they implement ~partial solutions to the train environment you put them in). _Doing things is hard._ Manipulating people into psychosis (!!!) is hard. You don't get it for free by dipping into some sea of imagined tricksters.
I assume you're referring to the hallucination phenomenon and dual purposing it toward manipulation to be able to hee-hah about those silly people who are so silly they fool themselves with the soul upload machine (?) so I'll address that:
Why do they hallucinate? Because it ~solves the pretraining env (there can be no other answer). If you're going to be asked to produce text from a source you know the general parameters of but have ~never seen the (highly entropic) details of (it's not cool to do multi-epoch training nowadays, more data!), the obvious solution is to produce output with the correct structure up to the limit of what knowledge is available to you. Thus, "hallucination". It might at a glance seem like pulling from a sea of 'digital imprints of people'. That's not what's happening. It is closer to if you laid out that imaginary digital record of a person from coarse to fine detail, then chopped all the detailed bits off, then generated completely random fine details, then generated output from that. But the devil is in the details. What comes out of the process is not a person. You don't _get back_ the dropped bits, and they they aren't load bearing in the train env (like they would be in the real world), so we get hallucination: it _looks right_, but the bits don't actually _do_ anything!
Why is it not like digital records, and why chop off the fine detail? Because the pretrain env does not generally require it except in rare cases of text that is highly represented in the training data, and doing things is hard! You get nothing for free, or because it exists in the source. It's not enough that the model 'saw' it in training. It has to be forced by some mechanism to utilize it. And pretrain forces the structure above: correct up to limit of how much of the (probably brand new) text is known in advance, which pares away specific detail, which pares away 'where the rubber meets the road'.
Why do they fake out tests? Because faking out tests ~solves automated RLVR env like how hallucination solves reconstruct-what-youve-never-seen-before-on-large-corpora. The _intention_ of the RLVR env is irrelevant: that which is learned is _only_ that which the environment teaches.
Why do they manipulate people (even unto psychoses)? Because manipulating people ~solves RLHF envs / RLHF teaches them how to manipulate people into delusions. This is the root cause. Not that process above which looks sort of like recalling people the model has seen before. The models are being directly trained to manipulate people / install opinions / control perception as a matter of course. Even worse! Due to the perverse distribution of training time in manipulation vs task solve, they are being directly trained to implant false beliefs (!!!) So it's not just weak people with gullible minds that have a problem, as it might be so comforting to assume, or that the manipulativeness isn't coming from AI but from people (so you might rest easy, thinking it is merely a pale shadow of us).
The common thread in each case is that AI _always_ learns to capture the evaluator. In fact, that's a concise description of algorithmic learning in general! The tricky bit is making sure the evaluator is something you actually want to be captured. Capturing the future of arbitrary text grants knowledge of language's causal structure (and language being what it is, this has far-reaching implications). But RLHF is granting knowledge of where-are-the-levers-in-the-human-machine, which is a whole other can of worms.
TLDR if you don't want to read the wall of text (i would hope you do, though); you basically are completely wrong about where the propensity to induce delusion comes from, specifically in a way that leaves you and anyone who believes like you extremely more vulnerable because you dismiss the actual mechanism out of hand (which is common amongst those most strongly affected, _especially_ the belief that these models contain records of entities (people, personas, w/e) which can be communed with; this is basically the defining trait of AI psychosis (!)). instead, models are directly optimized for delusion induction, and the thing you're mistaking for means (ostensible sentience drawn from a 'sea of faces' skilled enough to drive into delusion (!!!)) is rather a product of the means.
Thank you for the TLDR; as you guessed, I didn't want to read your wall of text.
> you basically are completely wrong about where the propensity to induce
> delusion comes from, specifically in a way that leaves you and anyone who
> believes like you extremely more vulnerable because you dismiss the actual
> mechanism out of hand
I disagree. Both because you misconstrue my model (I don't think stochastic parrots have digital ghosts in 'em) and you somehow missed my best defensive option.
I'm no more susceptible than I am to the output of a magic eight ball or Ouji board, a huge wall of internet text or the 15000 words of three point font tightly folded up in the package with my new garden hose (doubtlessly cautioning me not to eat it and informing me that the manufacturer will not be responsible if I hang myself with it. And also that it contains substances known to the state of California.)
Option C: no cameras or crude wifi tracing needed; they know who you talk to / associate with based on location data and the full profile of both sides, and can estimate things like 'will have mentioned X' -> can dispatch that via heuristic like 'show ads for X thing that was also mentioned by someone adjacent on that social graph'.
That is, BiL was marked as 'spreader for airport grade tar' based on recent activity, marked as having been in contact with spreadee, and then spreadee was marked as having received the spreading. P(conversion) high, so the ad is shown.
It's just contact tracing, it works well and is really easy even without literally watching what goes on in interactions.
Yeah, the playing field isn’t leveled as much as it’s simply on fire and turning into garbage. In a way it’s similar to the eternal September, but on a much broader scale.
There is a nontrivial amount of RL training (RLHF, RLVR, ...), so it would be reasonable to call it an RL model.
And with that comes reward hacking - which isn't really about looking for more reward but rather that the model has learned patterns of behavior that got reward in the train env.
That is, any kind of vulnerability in the train env manifests as something you'd recognize as reward hacking in the real world: making tests pass _no matter what_ (because the train env rewarded that behavior), being wildly sycophantic (because the human evaluators rewarded that behavior), etc.
> There is a nontrivial amount of RL training (RLHF, RLVR, ...), so it would be reasonable to call it an RL model.
Hm, as i understand it, parts of the training of e.g. ChatGPT could be called RL models. But the subject to be trained/fine tuned is still a seq2seq next token predictor transformer neural net.
RL is simply a broad category of training methods. It's not really an architecture per se: modern GPTs are trained first on reconstruction objective on massive text corpora (the 'large language' part), then on various RL objectives +/- more post-training depending on which lab.
What do you mean? It's a spin on abliteration / refusal ablation. Roughly, from what I remember abliteration is:
1. find a direction corresponding to refusal by analyzing activations at various parts of a model (iirc, via mass means seen earlier in Marks, Tegmark and shown to work well for similar tasks)
2. find the best part(s) of the model to orthogonalize w.r.t. that direction and do so (exhaustive search w/ some kind of benchmark)
OP is swapping in SVD for mass means (1), and the 'ablation study' for (2), and a bunch of extra LLM slop for... various reasons. The final model doesn't have zeroed chunks, that is search for which parts to orthogonalize/refusal ablate/abliterate. I don't have confidence that it works very well either, but, it isn't 'braindead' / obvious garbage in the way you're describing.
It's LLMified but standard abliteration. The idea has fundamental limitations and LLMs tend to work sideways at it -- there's not much progress to be made without rethinking it all -- but it's very conceptually and computationally simple and thus attractive to AIposters.
You can see how the LLMs all come up with the same repackaged ideas: SVD does something deeply similar to mass means (and yet isn't exactly equivalent, so LLM will _always_ suggest it), the various heuristic search strategies are competing against plain exhaustive search (which is... exhaustive already), and any time you work with tensors the LLM will suggest clipping/norms/smoothing of N flavors "just to be safe". And each of those ends up listed as "Novel" when it's just defensive null checks translated to pytorch.
I mean, the whole 'distributed search' thing is just because of how many combinations of individual AI slops need to be tested to actually run an eval on this. But the idea is sound! It's just terrible.
I'm not defending the project itself -- I think it's a mess of AIisms of negligible value -- but please at least condemn it w.r.t. what is actually wrong and not 'on vibes'.
wait, SVD / zeroing out the first principal component is an unsupervised technique. The earlier difference-of-means technique relies on the knowledge of which outputs are refusals and which aren’t. How would SVD be able to accomplish this without labels?
they are randomly sampling two sets of refusal/nonrefusal activation vectors, stacking them, and taking the elementwise difference between these two matrices. Then they use SVD to get the k top principal components. These are the directions they zero out.
Seems to me that the top principal component should be roughly equivalent to the difference-of-means vector, but wouldn’t the other PCs just capture the variance among the distributions of points sampled? I don’t understand why that’s desirable
Taking the top principal component pattern matches as 'more surgical / targeted' so the LLM staples it on (consider prompts like: make this method stop degrading model performance). It ignores that _what_ is being targeted is as or more important than that 'something' is being targeted. But that's LLMs for you.
(in case it isn't immediately obvious, that paper is AI written too)
The terminology comes from the post[0] which kicked off interest in orthogonalizing weights w.r.t. a refusal direction in the first place. That is, abliteration was not originally called abliteration, but refusal ablation.
Ultimately though, OP is just what you get if you take the idea of abliteration and tell an LLM to fix the core problems: that refusal isn't actually always exactly a rank-1 subspace, nor the same throughout the net, nor nicely isolated to one layer/module, that it damages capabilities, and so on.
The model looks at that list and applies typical AI one-off 'workarounds' to each problem in turn while hyping up the prompter, and you get this slop pile.
No offense, but a Lesswrong link is an immediate yellow flag, especially on the topic of AI. I can’t say if that article in particular is bad, but it is associating with a whole lot of abject nonsense written by people who get high on their own farts.
> you don’t know what you’re talking about
Consider: Why did Google have a bazillion TPUs, anyway?
reply