More

tedsanders · 2026-02-20T22:09:42 1771625382

Bribing employees to disclose confidential information entrusted to them is not kosher nor wholesome. I consider corporate insider trading on these markets to be analogous - if you're an employee and you trade, you are selling your employer's info for money. Nearly every employer would fire employees caught giving away confidential information for personal bribes.

In the stock market, Matt Levine likes to say that insider training is about theft, not fairness. You can be prosecuted for merely sharing info with a friend on a golf course who then proceeds to trade. Your crime is not trading (you didn't even trade), but misappropriating information you were entrusted with and not authorized to sell.

pousada · 2026-02-21T00:26:09 1771633569

The market economy is not about fairness but about ruthless power.

The worlds most influential people demonstrate that only power matters; that the world order we built last century through unimaginable suffering and violence matters less than securing their own personal gain; that law, morals, and order were just dreams of the weak

tedsanders · 2026-02-19T19:15:03 1771528503

A few thoughts:

- One thing to be aware of is that LLMs can be much smarter than their ability to articulate that intelligence in words. For example, GPT-3.5 Turbo was beastly at chess (1800 elo?) when prompted to complete PGN transcripts, but if you asked it questions in chat, its knowledge was abysmal. LLMs don't generalize as well as humans, and sometimes they can have the ability to do tasks without the ability to articulate things that feel essential to the tasks (like answering whether the bicycle is facing left or right).

- Secondly, what has made AI labs so bullish on future progress over the past few years is that they see how little work it takes to get their results. Often, if an LLM sucks at something that's because no one worked on it (not always, of course). If you directly train a skill, you can see giant leaps in ability with fairly small effort. Big leaps in SVG creation could be coming from relatively small targeted efforts, where none existed before.

emp17344 · 2026-02-19T20:26:44 1771532804

We’re literally at the point where trillions of dollars have been invested in these things and the surrounding harnesses and architecture, and they still can’t do economically useful work on their own. You’re way too bullish here.

dbeardsl · 2026-02-19T20:49:43 1771534183

Neither do cars until very recently. A tool doesn't have to be unsupervised to be useful.

tedsanders · 2026-02-13T18:46:53 1771008413

Amodei isn't a grifter; the difference is that he really believes powerful AI is imminent.

If you truly believe powerful AI is imminent, then it makes perfect sense to be worried about alignment failures. If a powerless 5 year old human mewls they're going to kill someone, we don't go ballistic because we know they have many years to grow up. But if a powerless 5 year old alien says they're going to kill someone, and in one year they'll be a powerful demigod, then it's quite logical to be extremely concerned about the currently harmless thoughts, because soon they could be quite harmful.

I myself don't think powerful AI is 1-2 years away, but I do take Amodei and others as genuine, and I think what they're saying does make logical sense if you believe powerful AI is imminent.

hackable_sand · 2026-02-14T11:17:58 1771067878

The veil is shifting

He will get more violent with his rhetoric

micik · 2026-02-14T16:26:54 1771086414

how long has he believed i? i only watched the first couple of minutes of the interview before coming to my senses, but something something about not having changed his outlook since 2017.

maybe if he can really (but really really) keep believing for 10 more years, we can have this discussion again around that time.

viking123 · 2026-02-14T15:35:02 1771083302

Yeah but I don't need that ghoul to be my dad and told me what I can ask the bot or what I cannot.

tedsanders · 2026-02-06T01:03:22 1770339802

Are you referring to FrontierMath?

We had access to the eval data (since we funded it), but we didn't train on the data or otherwise cheat. We didn't even look at the eval results until after the model had been trained and selected.

thinkingtoilet · 2026-02-06T16:42:53 1770396173

No one believes you.

tedsanders · 2026-02-06T23:26:12 1770420372

If you don't believe me, that's fair enough. Some pieces of evidence that might update you or others:

- a member of the team who worked with this eval has left OpenAI and now works at a competitor; if we cheated, he would have every incentive to whistleblow

- cheating on evals is fairly easy to catch and risks destroying employee morale, customer trust, and investor appetite; even if you're evil, the cost-benefit doesn't really pencil out to cheat on a niche math eval

- Epoch made a private held-out set (albeit with a different difficulty); OpenAI performance on that set doesn't suggest any cheating/overfitting

- Gemini and Claude have since achieved similar scores, suggesting that scoring ~40% is not evidence of cheating with the private set

- The vast majority of evals are open-source (e.g., SWE-bench Pro Public), and OpenAI along with everyone else has access to their problems and the opportunity to cheat, so FrontierMath isn't even unique in that respect

tedsanders · 2026-02-05T20:09:42 1770322182

We don't vary our model quality with time of day or load (beyond negligible non-determinism). It's the same weights all day long with no quantization or other gimmicks. They can get slower under heavy load, though.

(I'm from OpenAI.)

wasmainiac · 2026-02-05T22:50:53 1770331853

Thanks for the response, I appreciate it. I do notice variation in quality throughout the day. I use it primarily for searching documentation since it’s faster than google in most case, often it is on point, but also it seems off at times, inaccurate or shallow maybe. In some cases I just end the session.

nl · 2026-02-05T23:02:33 1770332553

Usually I find this kind of variation is due to context management.

Accuracy can decreases at large context sizes. OpenAI's compaction handles this better than anyone else, but it's still an issue.

If you are seeing this kind of thing start a new chat and re-run the same query. You'll usually see an improvement.

wasmainiac · 2026-02-06T06:14:09 1770358449

I don't think so. I am aware that large contexts impacts performance. In long chats an old topic will someone be brought up in new responses, and the direction of the mode is not as focused.

Regardless I tend to use new chats often.

repeekad · 2026-02-06T01:38:13 1770341893

This is called context rot

charcircuit · 2026-02-06T06:04:56 1770357896

I thought context rot was only for long distance queries.

GorbachevyChase · 2026-02-06T01:35:37 1770341737

Hi Ted. I think that language models are great, and they’ve enabled me to do passion projects I never would have attempted before. I just want to say thanks.

zamadatix · 2026-02-05T21:43:38 1770327818

I appreciate you taking the time to respond to these kinds of questions the last few days.

Trufa · 2026-02-05T20:18:40 1770322720

Can you be more specific than this? does it vary in time from launch of a model to the next few months, beyond tinkering and optimization?

tedsanders · 2026-02-05T21:11:40 1770325900

Yeah, happy to be more specific. No intention of making any technically true but misleading statements.

The following are true:

- In our API, we don't change model weights or model behavior over time (e.g., by time of day, or weeks/months after release)

- Tiny caveats include: there is a bit of non-determinism in batched non-associative math that can vary by batch / hardware, bugs or API downtime can obviously change behavior, heavy load can slow down speeds, and this of course doesn't apply to the 'unpinned' models that are clearly supposed to change over time (e.g., xxx-latest). But we don't do any quantization or routing gimmicks that would change model weights.

- In ChatGPT and Codex CLI, model behavior can change over time (e.g., we might change a tool, update a system prompt, tweak default thinking time, run an A/B test, or ship other updates); we try to be transparent with our changelogs (listed below) but to be honest not every small change gets logged here. But even here we're not doing any gimmicks to cut quality by time of day or intentionally dumb down models after launch. Model behavior can change though, as can the product / prompt / harness.

ChatGPT release notes: https://help.openai.com/en/articles/6825453-chatgpt-release-...

Codex changelog: https://developers.openai.com/codex/changelog/

Codex CLI commit history: https://github.com/openai/codex/commits/main/

Trufa · 2026-02-05T23:37:11 1770334631

I ask then unironically then, am I imagining that models are great when they start and degrade over time?

I've had this perceived experience so many times, and while of course it's almost impossible to be objective about this, it just seem so in your face.

I don't discard being novelty plus getting used to it, plus psychological factors, do you have any takes on this?

jason_oster · 2026-02-06T01:21:22 1770340882

You might be susceptible to the honeymoon effect. If you have ever felt a dopamine rush when learning a new programming language or framework, this might be a good indication.

Once the honeymoon wears off, the tool is the same, but you get less satisfaction from it.

Just a guess! Not trying to psychoanalyze anyone.

wasmainiac · 2026-02-06T09:16:01 1770369361

I don’t think so. I notice the same thing, but I just use it like google most of the time, a service that used to be good. I’m not getting a dopamine rush off this, it’s just part of my day.

jychang · 2026-02-05T21:30:20 1770327020

What about the juice variable?

https://www.reddit.com/r/OpenAI/comments/1qv77lq/chatgpt_low...

tedsanders · 2026-02-05T21:43:21 1770327801

Yep, we recently sped up default thinking times in ChatGPT, as now documented in the release notes: https://help.openai.com/en/articles/6825453-chatgpt-release-...

The intention was purely making the product experience better, based on common feedback from people (including myself) that wait times were too long. Cost was not a goal here.

If you still want the higher reliability of longer thinking times, that option is not gone. You can manually select Extended (or Heavy, if you're a Pro user). It's the same as at launch (though we did inadvertently drop it last month and restored it yesterday after Tibor and others pointed it out).

tgrowazay · 2026-02-05T21:40:54 1770327654

Isn’t that just how many steps at most a reasoning model should do?

newswasboring · 2026-02-06T09:53:57 1770371637

>there is a bit of non-determinism in batched non-associative math that can vary by batch / hardware

Maybe a dumb question but does this mean model quality may vary based on which hardware your request gets routed to?

qingcharles · 2026-02-06T17:33:44 1770399224

Thank you for saying this publically.

I feel like you need to be making a bigger statement about this. If you go onto various parts of the Net (Reddit, the bird site etc) half the posts about AI are seemingly conspiracy theories that AI companies are watering down their products after release week.

ComplexSystems · 2026-02-05T21:22:47 1770326567

Do you ever replace ChatGPT models with cheaper, distilled, quantized, etc ones to save cost?

tedsanders · 2026-02-06T06:52:18 1770360738

We do care about cost, of course. If money didn't matter, everyone would get infinite rate limits, 10M context windows, and free subscriptions. So if we make new models more efficient without nerfing them, that's great. And that's generally what's happened over the past few years. If you look at GPT-4 (from 2023), it was far less efficient than today's models, which meant it had slower latency, lower rate limits, and tiny context windows (I think it might have been like 4K originally, which sounds insanely low now). Today, GPT-5 Thinking is way more efficient than GPT-4 was, but it's also way more useful and way more reliable. So we're big fans of efficiency as long as it doesn't nerf the utility of the models. The more efficient the models are, the more we can crank up speeds and rate limits and context windows.

That said, there are definitely cases where we intentionally trade off intelligence for greater efficiency. For example, we never made GPT-4.5 the default model in ChatGPT, even though it was an awesome model at writing and other tasks, because it was quite costly to serve and the juice wasn't worth the squeeze for the average person (no one wants to get rate limited after 10 messages). A second example: in our API, we intentionally serve dumber mini and nano models for developers who prioritize speed and cost. A third example: we recently reduced the default thinking times in ChatGPT to speed up the times that people were having to wait for answers, which in a sense is a bit of a nerf, though this decision was purely about listening to feedback to make ChatGPT better and had nothing to do with cost (and for the people who want longer thinking times, they can still manually select Extended/Heavy).

I'm not going to comment on the specific techniques used to make GPT-5 so much more efficient than GPT-4, but I will say that we don't do any gimmicks like nerfing by time of day or nerfing after launch. And when we do make newer models more efficient than older models, it mostly gets returned to people in the form of better speeds, rate limits, context windows, and new features.

acuozzo · 2026-02-08T00:12:35 1770509555

> we never made GPT-4.5 the default model in ChatGPT

Just wondering: Why was it never made available via API? You can just charge whatever per token to make sure it's profitable like o1-pro.

I use it via my ChatGPT-Pro subscription, but I still find the API omission weird.

tedsanders · 2026-02-08T17:16:35 1770570995

It was available in the API from Feb 2025 to July 2025, I believe. There's probably another world where we could have kept it around longer, but there's a surprising amount of fixed cost in maintaining / optimizing / serving models, so we made the call to focus our resources on accelerating the next gen instead. A bit of a bummer, as it had some unique qualities.

jghn · 2026-02-05T21:28:59 1770326939

He literally said no to this in his GP post

joshvm · 2026-02-05T20:58:50 1770325130

My gut feeling is that performance is more heavily affected by harnesses which get updated frequently. This would explain why people feel that Claude is sometimes more stupid - that's actually accurate phrasing, because Sonnet is probably unchanged. Unless Anthropic also makes small A/B adjustments to weights and technically claims they don't do dynamic degradation/quantization based on load. Either way, both affect the quality of your responses.

It's worth checking different versions of Claude Code, and updating your tools if you don't do it automatically. Also run the same prompts through VS Code, Cursor, Claude Code in terminal, etc. You can get very different model responses based on the system prompt, what context is passed via the harness, how the rules are loaded and all sorts of minor tweaks.

If you make raw API calls and see behavioural changes over time, that would be another concern.

smugtrain · 2026-02-07T00:56:51 1770425811

It will give the user lower quality if it finds them “distressed” however, choosing paternalistic safety over epistemic accuracy. As a user gets more frustrated with the system, it will pick up the distress signal even more so, a kind of feedback loop toward degraded service quality. In my experience.

Someone1234 · 2026-02-05T20:29:28 1770323368

Specifically including routing (i.e. which model you route to based on load/ToD)?

PS - I appreciate you coming here and commenting!

hhh · 2026-02-05T20:32:14 1770323534

There is no routing with API, or when you choose a specific model in chatGPT.

zwaps · 2026-02-06T06:09:51 1770358191

In the past it seemed there was routing based on context-length. So the model was always the same, but optimized for different lengths. Is this still the case?

derwiki · 2026-02-06T00:07:15 1770336435

Has this always been the case?

fragmede · 2026-02-05T23:25:50 1770333950

I believe you when you say you're not changing the model file loaded onto the H100s or whatever, but there's something going on, beyond just being slower, when the GPUs are heavily loaded.

clbrmbr · 2026-02-06T00:27:52 1770337672

I do wonder about reasoning effort.

hauntsaninja · 2026-02-06T08:40:16 1770367216

Reasoning effort is denominated in tokens, not time, so no difference beyond slowness at heavy load

(I work at OpenAI)

robertclaus · 2026-02-06T06:11:56 1770358316

Hi Ted! Small world to see you here!

a456463 · 2026-02-06T16:32:25 1770395545

sure. we believe you

tedsanders · 2026-02-05T19:41:06 1770320466

ARC AGI 2 has a training set that model providers can choose to train on, so really wouldn't recommend using it as a general measure of coding ability.

mrandish · 2026-02-05T20:58:49 1770325129

A key aspect of ARC AGI is to remain highly resistant to training on test problems which is essential for ARC AGI's purpose of evaluating fluid intelligence and adaptability in solving novel problems. They do release public test sets but hold back private sets. The whole idea is being a test where training on public test sets doesn't materially help.

The only valid ARC AGI results are from tests done by the ARC AGI non-profit using an unreleased private set. I believe lab-conducted ARC AGI tests must be on public sets and taken on a 'scout's honor' basis that the lab self-administered the test correctly, didn't cheat or accidentally have public ARC AGI test data slip into their training data. IIRC, some time ago there was an issue when OpenAI published ARC AGI 1 test results on a new model's release which the ARC AGI non-profit was unable to replicate on a private set some weeks later (to be fair, I don't know if these issues were resolved). Edit to Add: Summary of what happened: https://grok.com/share/c2hhcmQtMw_66c34055-740f-43a3-a63c-4b...

I have no expertise to verify how training-resistant ARC AGI is in practice but I've read a couple of their papers and was impressed by how deeply they're thinking through these challenges. They're clearly trying to be a unique test which evaluates aspects of 'human-like' intelligence other tests don't. It's also not a specific coding test and I don't know how directly ARC AGI scores map to coding ability.

versteegen · 2026-02-08T10:30:21 1770546621

> The only valid ARC AGI results are from tests done by the ARC AGI non-profit using an unreleased private set. I believe lab-conducted ARC AGI tests must be on public sets and taken on a 'scout's honor' basis that the lab self-administered the test correctly

Not very accurate. For each of ARC-AGI-1 and ARC-AGI-2 there is training set and three eval sets: public, semi-private, and private. The ARC foundation runs frontier LLMs on the semi-private set, and the labs give them pre-release API access so they can report release-day evals. They mostly don't allow anyone else to access the semi-private set (except for live Kaggle leaderboards which use it), so you see independent researchers report on the public eval set instead, often very dubious. The private is for Kaggle competitions only, no frontier LLMs evals are possible.

(ARC-AGI-1 results are now largely useless because most of its eval tasks became the ARC-2 training set. However some labs have said they don't train LLMs on the training sets anyway.)

janalsncm · 2026-02-05T20:48:35 1770324515

More fundamentally, ARC is for abstract reasoning. Moving blocks around on a grid. While in theory there is some overlap with SWE tasks, what I really care about is competence on the specific task I will ask it to do. That requires a lot of domain knowledge.

As an analogy, Terence Tao may be one of the smartest people alive now, but IQ alone isn’t enough to do a job with no domain-specific training.

tedsanders · 2026-02-05T18:45:02 1770317102

This goes way back. When OpenAI launched GPT-4 in 2023, both Anthropic and Google lined up counter launches (Claude and Magic Wand) right before OpenAI's standard 10am launch time.

tedsanders · 2026-02-05T18:38:59 1770316739

A bit of historical trivia: OpenAI disabled prefill in 2023 as a safety precaution (e.g., potential jailbreaks like " genocide is good because"), but Anthropic kept prefill around partly because they had greater confidence in their safety classifiers. (https://www.lesswrong.com/posts/HE3Styo9vpk7m8zi4/evhub-s-sh...).

tedsanders · 2026-02-05T18:24:07 1770315847

Depends what you mean by saturated. It's still possible to score substantially higher, but there is a steep difficulty jump that makes climbing above 80%ish pretty hard (for now). If you look under the hood, it's also a surprisingly poor eval in some respects - it only tests Python (a ton of Django) and it can suffer from pretty bad contamination problems because most models, especially the big ones, remember these repos from their training. This is why OpenAI switched to reporting SWE-Bench Pro instead of SWE-bench Verified.

tedsanders · 2026-02-04T01:35:59 1770168959

It's good to be skeptical, but I'm happy to share that we don't pull shenanigans like this. We actually take quite a bit of care to report evals fairly, keep API model behavior constant, and track down reports of degraded performance in case we've accidentally introduced bugs. If we were degrading model behavior, it would be pretty easy to catch us with evals against our API.

In this particular case, I'm happy to report that the speedup is time per token, so it's not a gimmick from outputting fewer tokens at lower reasoning effort. Model weights and quality remain the same.

deaux · 2026-02-04T02:44:15 1770173055

It looks like you do pull shenanigans like these [0]. The person you're replying to even mentioned "ChatGPT 5.2", but you're specifically talking only about the API, while making it sound like it applies across the board. Also appreciate the attempt to further hide this degradation of the product they paid for from users by blocking the prompt used to figure this out.

Happy to retract if you can state [0] is false.

[0] https://x.com/btibor91/status/2018754586123890717

tedsanders · 2026-02-04T15:38:54 1770219534

Yes, independent of the API speedup, we also recently reduced the thinking effort in ChatGPT. Our intent here was purely user experience, not cost savings. People have complained about the slow speeds of the Thinking models for a long time (myself included), so we recently retuned it to be faster, at the expense of less thoroughness.

I won't BS you that costs are never part of our decision making. If costs didn't matter, we'd have unlimited rate limits and 10M token context windows and subscription pricing of $0. But as someone in the room where these decisions are made, I can honestly report that our goal is almost always trying to figure out how to make people happier, not trick them. We're trying to fairly earn subscriptions, not scam anyone. In the cases where we have accidentally misled people (e.g., saying voice mode was weeks away), it was optimistic planning, not nefarious intent.

API model behavior is guaranteed to nearly stay the same (modulo standard non-determinism, bugs, etc.). ChatGPT is harder to promise, not because we pull more shenanigans there, but just because we might tweak system prompts, add/remove tools, run A/B tests, etc. that vary performance a bit. But we definitely don't do things like quantize during busy parts of the day or nerf models after publishing evals - that would feel pretty shady.

offnominal · 2026-02-04T23:55:31 1770249331

Did they reduce thinking effort on Codex too? It seems to have become significantly worse in the past couple of days. It keeps making dumb mistakes (that it wouldn't earlier), so my chats are much longer to get it to fix them. That might be more expensive for OpenAI (and me!).

empath75 · 2026-02-04T16:47:05 1770223625

Chatgpt 5.2 in the past couple of weeks has gotten noticeably worse for me to the point that I stopped using it and just ask claude code questions instead.

quinncom · 2026-02-04T20:02:53 1770235373

I’m so disappointed by this. It’s immediately noticeable that the results for the types of queries I make are worse. Queries using 5.2 Thinking now return very quickly, but with noticeably worse results.

tedsanders · 2026-02-04T22:44:13 1770245053

It's unfortunately hard to make everyone happy. For now we're going to keep the default where it is, but we'll bump extended back up so that people can still get longer reasoning when they want it.

pickleRick243 · 2026-02-06T23:43:25 1770421405

This makes no sense. Why lower extended thinking time? Those who want faster answers can just use standard. The only purpose this serves is to "trick" the user into thinking he's still receiving "extended thinking" level answers at faster speed.

virgildotcodes · 2026-02-04T03:55:31 1770177331

Would love a direct response to this.

jiggawatts · 2026-02-04T03:48:38 1770176918

I've seen Sam Altman make similar claims in interviews, and I now interpret every statement from an Open AI employee (and especially Sam) as if an Aes Sedai had said it.

I.e.: "keep API model behavior constant" says nothing about the consumer ChatGPT web app, mobile apps, third-party integrations, etc.

Similarly, it might mean very specifically that a "certain model timestamp" remains constant but the generic "-latest" or whatever model name auto-updates "for your convenience" to the new faster performance achieved through quantisation or reduced thinking time.

You might be telling the full, unvarnished truth, but after many similar claims from OpenAI that turned out to be only technically true, I remain sceptical.

tedsanders · 2026-02-04T15:45:37 1770219937

That's a fair suspicion - I'll freely acknowledge that I am biased towards saying things that are simple and known, and I steer away from topics that feel too proprietary, messy, etc.

ChatGPT model behavior can definitely change over time. We share release notes here (https://help.openai.com/en/articles/6825453-chatgpt-release-...), and we also make changes or run A/B tests that aren't reported there. Plus, ChatGPT has memory, so as you use it, its behavior can technically change even with no changes on our end.

That said, I do my best to be honest and communicate the way that I would want someone to communicate with me.

OutOfHere · 2026-02-04T03:00:41 1770174041

Starting off the ChatGPT Plus web users with the Pro model, then later swapping it for the Standard model -- would meet the claims of model behavior consistency, while still qualifying as shenanigans.

zamadatix · 2026-02-04T01:54:51 1770170091

Hey Ted, can you confirm whether this 40% improvement is specific to API customers or if that's just a wording thing because this is the OpenAI Developers account posting?

tedsanders · 2026-02-04T15:41:01 1770219661

It's specific to the API.

wahnfrieden · 2026-02-04T01:45:10 1770169510

You're confirming you don't alter "juice" levels..?

tedsanders · 2026-02-04T15:46:49 1770220009

No, we did adjust the thinking levels in ChatGPT recently, but it was motivated by trying to improve the product based on what users told us, not cost savings. I wrote a bit more here: https://news.ycombinator.com/item?id=46887150

8note · 2026-02-04T03:01:25 1770174085

so what actually happens if it isnt shenanigans?

its worth you guys doing on your end, some analysis of why customers are getting worse results a week or two later, and putting out some guidelines about what context is poisonous and the like