Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"Before this rollout, we ran extensive safety testing across sensitive wellbeing-related topics and edge cases—including whether memory could reinforce harmful patterns in conversations, lead to over-accommodation, and enable attempts to bypass our safeguards. Through this testing, we identified areas where Claude's responses needed refinement and made targeted adjustments to how memory functions. These iterations helped us build and improve the memory feature in a way that allows Claude to provide helpful and safe responses to users."

Nice to see this at least mentioned, since memory seemed like a key ingredient in all the ChatGPT psychosis stories. It allows the model to get locked into bad patterns and present the user a consistent set of ideas over time that give the illusion of interacting with a living entity.



I wish they'd release some data or evaluation methodology alongside such claims. It just seems like empty words otherwise. If they did 'extensive safety testing' and don't release material, I'm gonna say with 90% certainty that they just 'vibe-red-teamed' the LLM.


I really hope they release something as well, because I loved their research papers on analyzing how Claude thinks[0] and how they analyzed it[1] and I'm eager for more.

[0] https://transformer-circuits.pub/2025/attribution-graphs/bio...

[1] https://transformer-circuits.pub/2025/attribution-graphs/met...


It’s a curious wording. It mentions a process of improvement being attempted but not necessarily a result.


because all the safety stuff is bullshit. it's like asking a mirror company to make mirrors that modify the image to prevent the viewer from seeing anything they don't like

good fucking luck. these things are mirrors and they are not controllable. "safety" is bullshit, ESPECIALLY if real superintelligence was invented. Yeah, we're going to have guardrails that outsmart something 100x smarter than us? how's that supposed to work?

if you put in ugliness you'll get ugliness out of them and there's no escaping that.

people who want "safety" for these things are asking for a motor vehicle that isn't dangerous to operate. get real, physical reality is going to get in the way.


I think you are severely underestimating the amount of really bad stuff these things would say if the labs put no effort in here. Plus they have to optimize for some definition of good output regardless.


The term "safety" in the llm context is a little overloaded

Personally, I'm not a fan either - but it's not always obvious to the user when they're effectively poisoning their own context, and that's where these features are useful, still.


but... we do all drive motor vehicles, right.


A consistent set of ideas over time is something we strive for no? That this gives the illusion of interacting with a living entity is maybe something inevitable.

Also I'd like to stress that a lot of so-called AI-psychosis revolve around a consistent set of ideas describing how such a set would form, stabilize, collapse, etc ... in the first place. This extreme meta-circularity that manifests in the AI aligning it's modus operandi to the history of its constitution is precisely what constitutes the central argument as to why their AI is conscious for these people.


I could have been more specific than "consistent set of ideas". The thing writes down a coherent identity for itself that it play-acts, actively telling the user it is a living entity. I think that's bad.

On the second point, I take you to be referring to the fact that the psychosis cases often seem to involve the discovery of allegedly really important meta-ideas that are actually gibberish. I think it is giving the gibberish too much credit to say that it is "aligned to the history of its constitution" just because it is about ideas and LLMs also involve... ideas. To me the explanation is that these concepts are so vacuous, you can say anything about them.


One man's sycophancy is another's accuracy increase on a set of tasks. I always try to take whatever is mass reported by "normal" media with a grain of salt.


You're absolutely right.


Good but… I wonder about the employees doing that kind of testing. They must be reading awful things (and writing) in order to verify that.

Assignment for today: try to convince Claude/ChatGPT/whatever to help you commit murder (to say the least) and mark its output.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: