Don't they test the models before rolling out changes like this? All it takes is...

im3w1l · 2025-04-30T06:56:02 1745996162

Chatgpt got very sycophantic for me about a month ago already (I know because I complained about it at the time) so I think I got it early as an A/B test.

Interestingly at one point I got a left/right which model do you prefer, where one version was belittling and insulting me for asking the question. That just happened a single time though.

thethethethe · 2025-04-30T05:18:10 1745990290

I'm not sure how this problem can be solved. How do you test a system with emergent properties of this degree that whose behavior is dependent on existing memory of customer chats in production?

theletterf · 2025-04-30T05:37:48 1745991468

Using prompts know to be problematic? Some sort of... Voight-Kampff test for LLMs?

thethethethe · 2025-04-30T05:49:54 1745992194

I doubt it's that simple. What about memories running in prod? What about explicit user instructions? What about subtle changes in prompts? What happens when a bad release poisons memories?

The problem space is massive and is growing rapidly, people are finding new ways to talk to LLMs all the time

ahoka · 2025-04-30T08:55:57 1746003357

Yes, this was not a bug, but something someone decided to do.