More

chad1n · 2025-09-28T15:07:44 1759072064

So the author is in a clear conflict of interest with the contents of the blog because he's an employee of Anthropic. But regarding this "blog", showing the graph where OpenAI compares "frontier" models and shows gpt-4o vs o3-high is just disingenuous, o1 vs o3 would have been a closer fight between "frontier" models. Also today I learned that there are people paid to benchmark AI models in terms of how close they are to "human" level, apparently even "expert" level whatever that means. I'm not a LLM hater by any means, but I can confidently say that they aren't experts in any fields.

chad1n · 2025-06-10T21:03:49 1749589429

The guys in the other thread who said that OpenAI might have quantized o3 and that's how they reduced the price might be right. This o3-pro might be the actual o3-preview from the beginning and the o3 might be just a quantized version. I wish someone benchmarks all of these models to check for drops in quality.

simonw · 2025-06-10T21:12:26 1749589946

That's definitely not the case here. The new o3-pro is slow - it took two minutes just to draw me an SVG of a pelican riding a bicycle. o3-preview was much faster than that.

https://simonwillison.net/2025/Jun/10/o3-pro/

teruakohatu · 2025-06-12T03:02:10 1749697330

Do you think a cycling pelican is still a valid cursory benchmark? By now surely discussions about it are in the training set.

There is quite a few on Google Image search.

On the other hand they still seem to struggle!

FergusArgyll · 2025-06-11T00:00:41 1749600041

Wow! pelican benchmark is now saturated

esperent · 2025-06-11T05:43:27 1749620607

Not until I can count the feathers, ask for a front view of the same pelican, then ask for it to be animated, all still using SVG.

dtech · 2025-06-12T07:34:48 1749713688

I wonder how much of that is because it's getting more and more included in training data.

We now need to start using walrusses riding rickshaws

CamperBob2 · 2025-06-10T22:03:07 1749592987

Would you say this is the best cycling pelican to date? I don't remember any of the others looking better than this.

Of course by now it'll be in-distribution. Time for a new benchmark...

jstummbillig · 2025-06-10T22:18:52 1749593932

I love that we are in the timeline where we are somewhat seriously evaluating probably super human intelligence by their ability to draw a svg of a cycling pelican.

CamperBob2 · 2025-06-10T22:25:04 1749594304

I still remember my jaw hitting the floor when the first DALL-E paper came out, with the baby daikon radish walking a dog. How the actual fuck...? Now we're probably all too jaded to fully appreciate the next advance of that magnitude, whatever that turns out to be.

E.g., the pelicans all look pretty cruddy including this one, but the fact that they are being delivered in .SVG is a bigger deal than the quality of the artwork itself, IMHO. This isn't a diffusion model, it's an autoregressive transformer imitating one. The wonder isn't that it's done badly, it's that it's happening at all.

datameta · 2025-06-12T04:43:39 1749703419

This makes me think of a reduction gear as a metaphor. At a high enough ratio, the torque is enormous but being put toward barely perceptible movement. There is the huge amount of computation happening to result in SVG that resembles a pelican on a bicycle.

Gerardo1 · 2025-06-11T12:44:58 1749645898

I don't love that this is the conversation and when these models bake-in these silly scenarios with training data, everyone goes "see, pelican bike! super human intelligence!"

The point is never the pelican. The point is that if a thing has information about pelicans, and has information about bicycles, then why can't it combine those ideas? Is it because it's not intelligent?

CamperBob2 · 2025-06-11T14:56:15 1749653775

"I'm taking this talking dog right back to the pound. It told me to go long on AAPL. Totally overhyped"

johnmaguire · 2025-06-12T03:15:49 1749698149

Just because it's impressive doesn't mean it has "super human intelligence" though.

CamperBob2 · 2025-06-12T18:33:00 1749753180

Well, it certainly came up with a better-looking SVG pelican than this human could have.

simonw · 2025-06-10T23:55:55 1749599755

I like the Gemini 2.5 Pro ones a little more: https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...

AstroBen · 2025-06-10T22:19:35 1749593975

That's one good looking pelican

torginus · 2025-06-12T09:41:26 1749721286

This made me think of the 'draw a bike experiment', where people were asked to draw a bike from memory, and were suprisingly bad at recreating how the parts fit together in a sensible manner:

https://road.cc/content/blog/90885-science-cycology-can-you-...

ChatGPT seems to perform better than most, but with notable missing elements (where's the chain or the handlebars?). I'm not sure if those are due to a lack of understanding, or artistic liberties taken by the model?

k2xl · 2025-06-10T21:15:46 1749590146

Not distilled, same model. https://x.com/therealadamg/status/1932534244774957121?s=46&t...

eru · 2025-06-12T09:32:37 1749720757

Well, that might be more of a function of how long they let it 'reason' than anything intrinsic to the model?

Terretta · 2025-06-11T02:33:20 1749609200

> It's only available via the newer Responses API

And in ChatGPT Pro.

torginus · 2025-06-12T09:09:47 1749719387

I've wondered if some kind of smart pruning is possible during evaluation.

What I mean by that, is if a neuron implements a sigmoid function and its input weights are 10,1,2,3 that means if the first input is active, then evaluation the other ones is mathematically pointless, since it doesn't change the result, which recursively means the inputs of those neurons that contribute to the precursors are pointless as well.

I have no idea how feasible or practical is it to implement such an optimization and full network scale, but I think its interesting to think about

gkamradt · 2025-06-10T21:26:51 1749590811

o3-pro is not the same as the o3-preview that was shown in Dec '24. OpenAI confirmed this for us. More on that here: https://x.com/arcprize/status/1932535380865347585

weinzierl · 2025-06-10T21:36:14 1749591374

Is there a way to figure out likely quantization from the output. I mean, does quantization degrade output quality in certain ways that are different from other modification of other model properties (e.g. size or distillation)?

hapticmonkey · 2025-06-12T06:24:29 1749709469

What a great future we are building. If AI is supposed to run everything, everywhere....then there will be 2, maybe 3, AI companies. And nobody outside those companies knows how they work.

eru · 2025-06-13T01:29:46 1749778186

What makes you think so? So far, many new AI companies are sprouting and many of them seem to be able to roughly match the state-of-the-art very quickly. (But pushing the frontier seems to be harder.)

From the evidence we have so far, it does not look like there's any natural monopoly (or even natural oligopoly) in AI companies. Just the opposite. Especially with open weight models, or oven more so complete open source models.

jsjohnst · 2025-06-12T11:51:47 1749729107

> And nobody outside those companies knows how they work.

I think you meant to say:

And nobody knows how they work.

chad1n · 2025-04-17T14:43:52 1744901032

To be honest, checking if there is a path between two nodes is a better example of NP-Hard, because it's obvious why you can't verify a solution in polynomial time. Sure the problem isn't decidable, but it's hard to give problems are decidable and explain why the proof can't be verified in P time. Only problems that involve playing optimally a game (with more than one player) that can have cycles come to mind. These are the "easiest" to grasp.

nmilo · 2025-04-17T16:06:10 1744905970

Isn't this NP-complete? The "solution" here would be the steps to take in the path which can be found by brute-force

Wikipedia:

> 2. When the answer is "yes", this can be demonstrated through the existence of a short (polynomial length) solution.

> 3. The correctness of each solution can be verified quickly (namely, in polynomial time) and a brute-force search algorithm can find a solution by trying all possible solutions.

chad1n · 2025-03-31T21:59:26 1743458366

Considering how much they trust their LLMs, why don't they just run o1-pro to make a summary of the responses given in the feedback

chad1n · 2025-03-19T12:55:51 1742388951

It will support both, but considering the previous experiences with avx 512 on intel, I wouldn't that excited

chad1n · 2025-03-12T13:25:14 1741785914

I wonder how much LimeWire pays to buy all of these foss projects, must be a decent amount if everyone is selling his solution

Etheryte · 2025-03-12T13:42:39 1741786959

Wouldn't be surprised if the prices are fairly low actually, I would wager most of these projects make no income.

grimgrin · 2025-03-12T13:42:19 1741786939

limewire aint touchin soulseek

and it has people building alt.clients

    https://nicotine-plus.org
    https://github.com/slskd/slskd

(though these are not webapps, which was your main shtick i'm sure)

chad1n · 2025-03-02T22:34:23 1740954863

The idea is correct, a lot of people (including myself sometimes) just let an "agent" run and do some stuff and then check later if it finished. This is obviously more dangerous than just the LLM hallucinating functions, since at least you can catch the latter, but the first one depends on the tests of the project or your reviewer skills.

The real problem with hallucination is that we started using LLMs as search engines, so when it invents a function, you have to go and actually search the API on a real search engine.

zahlman · 2025-03-03T06:16:51 1740982611

>The real problem with hallucination is that we started using LLMs as search engines, so when it invents a function, you have to go and actually search the API on a real search engine.

That still seems useful when you don't already know enough to come up with good search terms.

chad1n · 2025-03-01T00:00:10 1740787210

These "OCR" tools who are actually multimodals are interesting because they can do more than just text abstraction, but their biggest flaw is hallucinations and overall the nondeterministic nature. Lately, I've been using Gemini to turn my notebooks into Latex documents, so I can see a pretty nice usecase for this project, but it's not for "important" papers or papers that need 100% accuracy.

thelittleone · 2025-03-01T06:17:12 1740809832

How about building a tool which indexes ocr chunks / tokens and a confidence grading. Setting a tolerance level and defining actions where the token or chunk (s) fall below that level. Actions could include could include automated verification using another model or last resort human.

Eisenstein · 2025-03-01T20:57:59 1740862679

How would you calculate the confidence? LLMs are notoriously bad at grading their own output.

chad1n · 2025-02-28T23:41:21 1740786081

It's not really a hot take, considering the price, they probably released it to scam some people when they to `benchmark` it or to buy the `pro` version. You must be completely in denial to think that gpt4.5 had a successful launch, considering that 3 days before, a real and useful model was released by their competitor.

chad1n · 2025-02-28T09:22:15 1740734535

I quit the original l"Firefox" a long time ago, I've been using librewolf since its release and now zen (also a firefox fork) and I keep ungoogled chromium in case a site is broken on firefox.