Bribing employees to disclose confidential information entrusted to them is not kosher nor wholesome. I consider corporate insider trading on these markets to be analogous - if you're an employee and you trade, you are selling your employer's info for money. Nearly every employer would fire employees caught giving away confidential information for personal bribes.
In the stock market, Matt Levine likes to say that insider training is about theft, not fairness. You can be prosecuted for merely sharing info with a friend on a golf course who then proceeds to trade. Your crime is not trading (you didn't even trade), but misappropriating information you were entrusted with and not authorized to sell.
The market economy is not about fairness but about ruthless power.
The worlds most influential people demonstrate that only power matters; that the world order we built last century through unimaginable suffering and violence matters less than securing their own personal gain; that law, morals, and order were just dreams of the weak
- One thing to be aware of is that LLMs can be much smarter than their ability to articulate that intelligence in words. For example, GPT-3.5 Turbo was beastly at chess (1800 elo?) when prompted to complete PGN transcripts, but if you asked it questions in chat, its knowledge was abysmal. LLMs don't generalize as well as humans, and sometimes they can have the ability to do tasks without the ability to articulate things that feel essential to the tasks (like answering whether the bicycle is facing left or right).
- Secondly, what has made AI labs so bullish on future progress over the past few years is that they see how little work it takes to get their results. Often, if an LLM sucks at something that's because no one worked on it (not always, of course). If you directly train a skill, you can see giant leaps in ability with fairly small effort. Big leaps in SVG creation could be coming from relatively small targeted efforts, where none existed before.
We’re literally at the point where trillions of dollars have been invested in these things and the surrounding harnesses and architecture, and they still can’t do economically useful work on their own. You’re way too bullish here.
Amodei isn't a grifter; the difference is that he really believes powerful AI is imminent.
If you truly believe powerful AI is imminent, then it makes perfect sense to be worried about alignment failures. If a powerless 5 year old human mewls they're going to kill someone, we don't go ballistic because we know they have many years to grow up. But if a powerless 5 year old alien says they're going to kill someone, and in one year they'll be a powerful demigod, then it's quite logical to be extremely concerned about the currently harmless thoughts, because soon they could be quite harmful.
I myself don't think powerful AI is 1-2 years away, but I do take Amodei and others as genuine, and I think what they're saying does make logical sense if you believe powerful AI is imminent.
how long has he believed i? i only watched the first couple of minutes of the interview before coming to my senses, but something something about not having changed his outlook since 2017.
maybe if he can really (but really really) keep believing for 10 more years, we can have this discussion again around that time.
We had access to the eval data (since we funded it), but we didn't train on the data or otherwise cheat. We didn't even look at the eval results until after the model had been trained and selected.
If you don't believe me, that's fair enough. Some pieces of evidence that might update you or others:
- a member of the team who worked with this eval has left OpenAI and now works at a competitor; if we cheated, he would have every incentive to whistleblow
- cheating on evals is fairly easy to catch and risks destroying employee morale, customer trust, and investor appetite; even if you're evil, the cost-benefit doesn't really pencil out to cheat on a niche math eval
- Epoch made a private held-out set (albeit with a different difficulty); OpenAI performance on that set doesn't suggest any cheating/overfitting
- Gemini and Claude have since achieved similar scores, suggesting that scoring ~40% is not evidence of cheating with the private set
- The vast majority of evals are open-source (e.g., SWE-bench Pro Public), and OpenAI along with everyone else has access to their problems and the opportunity to cheat, so FrontierMath isn't even unique in that respect
We don't vary our model quality with time of day or load (beyond negligible non-determinism). It's the same weights all day long with no quantization or other gimmicks. They can get slower under heavy load, though.
Thanks for the response, I appreciate it. I do notice variation in quality throughout the day. I use it primarily for searching documentation since it’s faster than google in most case, often it is on point, but also it seems off at times, inaccurate or shallow maybe. In some cases I just end the session.
I don't think so. I am aware that large contexts impacts performance. In long chats an old topic will someone be brought up in new responses, and the direction of the mode is not as focused.
Hi Ted. I think that language models are great, and they’ve enabled me to do passion projects I never would have attempted before. I just want to say thanks.
Yeah, happy to be more specific. No intention of making any technically true but misleading statements.
The following are true:
- In our API, we don't change model weights or model behavior over time (e.g., by time of day, or weeks/months after release)
- Tiny caveats include: there is a bit of non-determinism in batched non-associative math that can vary by batch / hardware, bugs or API downtime can obviously change behavior, heavy load can slow down speeds, and this of course doesn't apply to the 'unpinned' models that are clearly supposed to change over time (e.g., xxx-latest). But we don't do any quantization or routing gimmicks that would change model weights.
- In ChatGPT and Codex CLI, model behavior can change over time (e.g., we might change a tool, update a system prompt, tweak default thinking time, run an A/B test, or ship other updates); we try to be transparent with our changelogs (listed below) but to be honest not every small change gets logged here. But even here we're not doing any gimmicks to cut quality by time of day or intentionally dumb down models after launch. Model behavior can change though, as can the product / prompt / harness.
You might be susceptible to the honeymoon effect. If you have ever felt a dopamine rush when learning a new programming language or framework, this might be a good indication.
Once the honeymoon wears off, the tool is the same, but you get less satisfaction from it.
I don’t think so. I notice the same thing, but I just use it like google most of the time, a service that used to be good. I’m not getting a dopamine rush off this, it’s just part of my day.
The intention was purely making the product experience better, based on common feedback from people (including myself) that wait times were too long. Cost was not a goal here.
If you still want the higher reliability of longer thinking times, that option is not gone. You can manually select Extended (or Heavy, if you're a Pro user). It's the same as at launch (though we did inadvertently drop it last month and restored it yesterday after Tibor and others pointed it out).
I feel like you need to be making a bigger statement about this. If you go onto various parts of the Net (Reddit, the bird site etc) half the posts about AI are seemingly conspiracy theories that AI companies are watering down their products after release week.
We do care about cost, of course. If money didn't matter, everyone would get infinite rate limits, 10M context windows, and free subscriptions. So if we make new models more efficient without nerfing them, that's great. And that's generally what's happened over the past few years. If you look at GPT-4 (from 2023), it was far less efficient than today's models, which meant it had slower latency, lower rate limits, and tiny context windows (I think it might have been like 4K originally, which sounds insanely low now). Today, GPT-5 Thinking is way more efficient than GPT-4 was, but it's also way more useful and way more reliable. So we're big fans of efficiency as long as it doesn't nerf the utility of the models. The more efficient the models are, the more we can crank up speeds and rate limits and context windows.
That said, there are definitely cases where we intentionally trade off intelligence for greater efficiency. For example, we never made GPT-4.5 the default model in ChatGPT, even though it was an awesome model at writing and other tasks, because it was quite costly to serve and the juice wasn't worth the squeeze for the average person (no one wants to get rate limited after 10 messages). A second example: in our API, we intentionally serve dumber mini and nano models for developers who prioritize speed and cost. A third example: we recently reduced the default thinking times in ChatGPT to speed up the times that people were having to wait for answers, which in a sense is a bit of a nerf, though this decision was purely about listening to feedback to make ChatGPT better and had nothing to do with cost (and for the people who want longer thinking times, they can still manually select Extended/Heavy).
I'm not going to comment on the specific techniques used to make GPT-5 so much more efficient than GPT-4, but I will say that we don't do any gimmicks like nerfing by time of day or nerfing after launch. And when we do make newer models more efficient than older models, it mostly gets returned to people in the form of better speeds, rate limits, context windows, and new features.
It was available in the API from Feb 2025 to July 2025, I believe. There's probably another world where we could have kept it around longer, but there's a surprising amount of fixed cost in maintaining / optimizing / serving models, so we made the call to focus our resources on accelerating the next gen instead. A bit of a bummer, as it had some unique qualities.
My gut feeling is that performance is more heavily affected by harnesses which get updated frequently. This would explain why people feel that Claude is sometimes more stupid - that's actually accurate phrasing, because Sonnet is probably unchanged. Unless Anthropic also makes small A/B adjustments to weights and technically claims they don't do dynamic degradation/quantization based on load. Either way, both affect the quality of your responses.
It's worth checking different versions of Claude Code, and updating your tools if you don't do it automatically. Also run the same prompts through VS Code, Cursor, Claude Code in terminal, etc. You can get very different model responses based on the system prompt, what context is passed via the harness, how the rules are loaded and all sorts of minor tweaks.
If you make raw API calls and see behavioural changes over time, that would be another concern.
It will give the user lower quality if it finds them “distressed” however, choosing paternalistic safety over epistemic accuracy.
As a user gets more frustrated with the system, it will pick up the distress signal even more so, a kind of feedback loop toward degraded service quality.
In my experience.
In the past it seemed there was routing based on context-length. So the model was always the same, but optimized for different lengths. Is this still the case?
I believe you when you say you're not changing the model file loaded onto the H100s or whatever, but there's something going on, beyond just being slower, when the GPUs are heavily loaded.
ARC AGI 2 has a training set that model providers can choose to train on, so really wouldn't recommend using it as a general measure of coding ability.
A key aspect of ARC AGI is to remain highly resistant to training on test problems which is essential for ARC AGI's purpose of evaluating fluid intelligence and adaptability in solving novel problems. They do release public test sets but hold back private sets. The whole idea is being a test where training on public test sets doesn't materially help.
The only valid ARC AGI results are from tests done by the ARC AGI non-profit using an unreleased private set. I believe lab-conducted ARC AGI tests must be on public sets and taken on a 'scout's honor' basis that the lab self-administered the test correctly, didn't cheat or accidentally have public ARC AGI test data slip into their training data. IIRC, some time ago there was an issue when OpenAI published ARC AGI 1 test results on a new model's release which the ARC AGI non-profit was unable to replicate on a private set some weeks later (to be fair, I don't know if these issues were resolved). Edit to Add: Summary of what happened: https://grok.com/share/c2hhcmQtMw_66c34055-740f-43a3-a63c-4b...
I have no expertise to verify how training-resistant ARC AGI is in practice but I've read a couple of their papers and was impressed by how deeply they're thinking through these challenges. They're clearly trying to be a unique test which evaluates aspects of 'human-like' intelligence other tests don't. It's also not a specific coding test and I don't know how directly ARC AGI scores map to coding ability.
> The only valid ARC AGI results are from tests done by the ARC AGI non-profit using an unreleased private set. I believe lab-conducted ARC AGI tests must be on public sets and taken on a 'scout's honor' basis that the lab self-administered the test correctly
Not very accurate. For each of ARC-AGI-1 and ARC-AGI-2 there is training set and three eval sets: public, semi-private, and private. The ARC foundation runs frontier LLMs on the semi-private set, and the labs give them pre-release API access so they can report release-day evals. They mostly don't allow anyone else to access the semi-private set (except for live Kaggle leaderboards which use it), so you see independent researchers report on the public eval set instead, often very dubious. The private is for Kaggle competitions only, no frontier LLMs evals are possible.
(ARC-AGI-1 results are now largely useless because most of its eval tasks became the ARC-2 training set. However some labs have said they don't train LLMs on the training sets anyway.)
More fundamentally, ARC is for abstract reasoning. Moving blocks around on a grid. While in theory there is some overlap with SWE tasks, what I really care about is competence on the specific task I will ask it to do. That requires a lot of domain knowledge.
As an analogy, Terence Tao may be one of the smartest people alive now, but IQ alone isn’t enough to do a job with no domain-specific training.
This goes way back. When OpenAI launched GPT-4 in 2023, both Anthropic and Google lined up counter launches (Claude and Magic Wand) right before OpenAI's standard 10am launch time.
A bit of historical trivia: OpenAI disabled prefill in 2023 as a safety precaution (e.g., potential jailbreaks like " genocide is good because"), but Anthropic kept prefill around partly because they had greater confidence in their safety classifiers. (https://www.lesswrong.com/posts/HE3Styo9vpk7m8zi4/evhub-s-sh...).
Depends what you mean by saturated. It's still possible to score substantially higher, but there is a steep difficulty jump that makes climbing above 80%ish pretty hard (for now). If you look under the hood, it's also a surprisingly poor eval in some respects - it only tests Python (a ton of Django) and it can suffer from pretty bad contamination problems because most models, especially the big ones, remember these repos from their training. This is why OpenAI switched to reporting SWE-Bench Pro instead of SWE-bench Verified.
It's good to be skeptical, but I'm happy to share that we don't pull shenanigans like this. We actually take quite a bit of care to report evals fairly, keep API model behavior constant, and track down reports of degraded performance in case we've accidentally introduced bugs. If we were degrading model behavior, it would be pretty easy to catch us with evals against our API.
In this particular case, I'm happy to report that the speedup is time per token, so it's not a gimmick from outputting fewer tokens at lower reasoning effort. Model weights and quality remain the same.
It looks like you do pull shenanigans like these [0]. The person you're replying to even mentioned "ChatGPT 5.2", but you're specifically talking only about the API, while making it sound like it applies across the board. Also appreciate the attempt to further hide this degradation of the product they paid for from users by blocking the prompt used to figure this out.
Yes, independent of the API speedup, we also recently reduced the thinking effort in ChatGPT. Our intent here was purely user experience, not cost savings. People have complained about the slow speeds of the Thinking models for a long time (myself included), so we recently retuned it to be faster, at the expense of less thoroughness.
I won't BS you that costs are never part of our decision making. If costs didn't matter, we'd have unlimited rate limits and 10M token context windows and subscription pricing of $0. But as someone in the room where these decisions are made, I can honestly report that our goal is almost always trying to figure out how to make people happier, not trick them. We're trying to fairly earn subscriptions, not scam anyone. In the cases where we have accidentally misled people (e.g., saying voice mode was weeks away), it was optimistic planning, not nefarious intent.
API model behavior is guaranteed to nearly stay the same (modulo standard non-determinism, bugs, etc.). ChatGPT is harder to promise, not because we pull more shenanigans there, but just because we might tweak system prompts, add/remove tools, run A/B tests, etc. that vary performance a bit. But we definitely don't do things like quantize during busy parts of the day or nerf models after publishing evals - that would feel pretty shady.
Did they reduce thinking effort on Codex too? It seems to have become significantly worse in the past couple of days. It keeps making dumb mistakes (that it wouldn't earlier), so my chats are much longer to get it to fix them. That might be more expensive for OpenAI (and me!).
Chatgpt 5.2 in the past couple of weeks has gotten noticeably worse for me to the point that I stopped using it and just ask claude code questions instead.
I’m so disappointed by this. It’s immediately noticeable that the results for the types of queries I make are worse. Queries using 5.2 Thinking now return very quickly, but with noticeably worse results.
It's unfortunately hard to make everyone happy. For now we're going to keep the default where it is, but we'll bump extended back up so that people can still get longer reasoning when they want it.
This makes no sense. Why lower extended thinking time? Those who want faster answers can just use standard. The only purpose this serves is to "trick" the user into thinking he's still receiving "extended thinking" level answers at faster speed.
I've seen Sam Altman make similar claims in interviews, and I now interpret every statement from an Open AI employee (and especially Sam) as if an Aes Sedai had said it.
I.e.: "keep API model behavior constant" says nothing about the consumer ChatGPT web app, mobile apps, third-party integrations, etc.
Similarly, it might mean very specifically that a "certain model timestamp" remains constant but the generic "-latest" or whatever model name auto-updates "for your convenience" to the new faster performance achieved through quantisation or reduced thinking time.
You might be telling the full, unvarnished truth, but after many similar claims from OpenAI that turned out to be only technically true, I remain sceptical.
That's a fair suspicion - I'll freely acknowledge that I am biased towards saying things that are simple and known, and I steer away from topics that feel too proprietary, messy, etc.
ChatGPT model behavior can definitely change over time. We share release notes here (https://help.openai.com/en/articles/6825453-chatgpt-release-...), and we also make changes or run A/B tests that aren't reported there. Plus, ChatGPT has memory, so as you use it, its behavior can technically change even with no changes on our end.
That said, I do my best to be honest and communicate the way that I would want someone to communicate with me.
Starting off the ChatGPT Plus web users with the Pro model, then later swapping it for the Standard model -- would meet the claims of model behavior consistency, while still qualifying as shenanigans.
Hey Ted, can you confirm whether this 40% improvement is specific to API customers or if that's just a wording thing because this is the OpenAI Developers account posting?
No, we did adjust the thinking levels in ChatGPT recently, but it was motivated by trying to improve the product based on what users told us, not cost savings. I wrote a bit more here: https://news.ycombinator.com/item?id=46887150
its worth you guys doing on your end, some analysis of why customers are getting worse results a week or two later, and putting out some guidelines about what context is poisonous and the like
In the stock market, Matt Levine likes to say that insider training is about theft, not fairness. You can be prosecuted for merely sharing info with a friend on a golf course who then proceeds to trade. Your crime is not trading (you didn't even trade), but misappropriating information you were entrusted with and not authorized to sell.
reply