Hacker Newsnew | past | comments | ask | show | jobs | submit | benedictevans's commentslogin

Deep Research doesn’t give the numbers that are in statcounter and statista. It’s choosing the wrong sources, but it’s also failing to represent them accurately.


Wow, that's really surprising. My experience with much simpler RAG workflows is that once you stick a number in the context the LLMs can reliably parrot that number back out again later on.

Presumably Deep Research has a bunch of weird multi-LLM-agent things going on, maybe there's something about their architecture that makes it more likely for mistakes like that to creep in?


Have a look at the previous essay. I couldn't get ChatGPT 4o to give me a number in a PDF correctly even when I gave it the PDF, the page number, and the row and column.

https://www.ben-evans.com/benedictevans/2025/1/the-problem-w...


I have a hunch that's a problem unique to the way ChatGPT web edition handles PDFs.

Claude gets that question right: https://claude.ai/share/7bafaeab-5c40-434f-b849-bc51ed03e85c

ChatGPT treats a PDF upload as a data extraction problem, where it first pulls out all of the embedded textual content on the PDF and feeds that into the model.

This fails for PDFs that contain images of scanned documents, since ChatGPT isn't tapping its vision abilities to extract that information.

Claude (and Gemini) both apply their vision capabilities to PDF content, so they can "see" the data.

I talked about this problem here: https://simonwillison.net/2024/Jun/27/ai-worlds-fair/#slide....

So my hunch is that ChatGPT couldn't extract useful information from the PDF you provided and instead fell back on whatever was in its training data, effectively hallucinating a response and pretending it came from the document.

That's a huge failure on OpenAI's behalf, but it's not illustrative of models being unable to interpret documents: it's illustrative of OpenAI's ChatGPT PDF feature being unable to extract non-textual image content (and then hallucinating on top of that inability).


Interesting, thanks. I think the higher level problem is that 1: I have no way to know this failure mode when using the product and 2: I don't really know if I can rely on Claude to get this right every single time either, or what else it would fail at instead.


Yeah, completely understand that. I talked about this problem on stage as an illustration of how infuriatingly difficult these tools are to use because of the vast number of weird undocumented edge cases like this.

This is an unfortunate example though because it undermines one of the few ways in which I've grown to genuinely trust these models: I'm confident that if the model is top tier it will reliably answer questions about information I've directly fed into the context.

[... unless it's GPT-4o and the content was scanned images bundled in a PDF!]

It's also why I really care that I can control the context and see what's in it - systems that hide the context from me (most RAG systems, search assistants etc) leave me unable to confidently tell what's been fed in, which makes them even harder for me to trust.


1: It's the TITLE of a 100 slide presentation. It's not the only thing it said, and it's a way to think about what was happening.

2: Mobile replaced the PC as the main way people use the internet and do their day-to-day computing. The consumer Internet runs on smartphone apps, not PCs. In 2013 a lot of people didn't understand that that was happening, so it was worth saying.


Well, I'm not trying to explain the state of the science and the engineering, but to work out what this means to everyone else. There are no products to analyse yet - which is part of the problem.


See slides 58 and 59 - this can take a while.

ChatGPT got to 100m users much faster than anything else because it's riding on all the infrastructure we already built in the last 20 years. To a consumer, it's 'just' a website, and you don't have to wait for telcos to build broadband networks or get everyone to buy a $600 smartphone.

But, most people go to the website and say 'well, that's very cool, but I don't know what I'd use it for'. It's very useful for coding and marketing, and a few general purposes, but it isn't - YET - very helpful for most of the things that most people do all day. A lot of the presentation is wondering about this.


Only OpenAI knows for sure, but so many non-tech people I know use ChatGPT for a sounding board for whatever. "My boyfriend sent me this text, how should I respond?" or "Teach me about investing." There are a bunch of people I know that don't use ChatGPT, I'm just surprised at the uptake by people who I didn't think would have as use for it have found it very useful.


How long is a while and what is it, that most people do all day?

A quick Google search for "most common job" came back with

    Cashier

    A cashier works in a retail environment and
    processes transactions for a customer's purchase.
I wouldn't be surprised if robots can do that on their own in 10 years.


Robots can already do that, they are used at large chains (McDonalds) and they are used all the time.

What they can't do is call the police when the hobo gets too wild, can't fix the inevitable bug in the process(by doing some 4th level menu bypass) and other random stuff that might pop up.

And when the robot can do all that humans are no longer viable as economic entities and will be out competed.


The problem is, the robot has to know what I want it to do without me having to dictate it.

That's the beauty of human interaction, it can't be massively truncated down to just even finger pointing.


I tried to capture this on the last slide before the conclusion - maybe all AI questions have one of two answers - "no-one knows" or "it will be the same as the last time"

this is one of the "no-one knows" questions


The question I'm asking isn't whether hallucinations can be fixed. It's what, if they are not fixed, are the economic consequences for the industry? How necessary is it that LLMs become trustworthy? How much valuation assumes that they will?


And is it even fixable?


The "hallucinations" problem feels to me like an inherent feature. For LLMs to have interesting output the temperature needs to be higher then zero. The whole system is interesting because it is probabilistic. "Hallucinations" (hate the word btw) are to LLMs as melting is to ice. There will be no 'meltless' ice because the melting is what makes it cold and useful.


Most of the presentation is saying that is isn't clear how this will work, it will take a long time, and it probably won't do everything.

Indeed, you would see that if you'd read even the first half-dozen 5 slides ;)


I've decided to not read the article/slides because the title, in conjuction with the other titles on the page, sounded stupid to me.

My time is not free, sorry.


Thank you - I will add this to my file of people expressing strong opinions of things they haven’t read and know nothing about.


Americans generally still have to compile and file tax returns. In other countries that is often entirely automated.


He didn’t object to jobs. He objected to a woman having a job.


He objected to a woman wanting to have a job.


"Where are the numbers coming from?"

The numbers are sourced, both in the charts and in the text. You can find a half-a-dozen more sources that say the same.

It's great for you that you're in the low-single-digit percentage of people that have already found use cases. Now, look at the data.


Nokia thought that phones would be like cars. Most car companies have dozens of models with different characteristics and prices, and that's fine. Even BMW has 20(!) top-level model categories. Nokia was selling devices all the way from high-end camera/phones for $$$ right down to $10 (unsubsidised) with two week battery life for emerging markets. I think you can argue that 50 was too many, but 10 or 20 was reasonable and still is - plenty of successful Android OEMs have ranges like that.

The real shift in phone design and in the range was that the screen took over the whole front of the device. That meant there was much less scope for different shapes and form-factors, and since you were no longer using most of the front for casing and keys there was less to design anyway.

Meanwhile all the actual phones ran either Series 40, the classic feature phone OS that won them half the entire market as 'easy to use', or series 60, the smartphone OS, that was an actual platform but problematic (and fragmented) in lots of ways.


Personally, I think Nokia didn't really think anything. The management was just incompetent and they didn't have any real strategy. They had 100 teams of engineers each doing their own phone model, and were just too lazy to fire unnecessary teams and focus. Somewhere deep down they knew, that it is stupid way to do things. However the money kept flowing in, so who cares.


It was also slow and bloated... and china happened. Nokia guys would start making a mockup and a powerpoint for a new product, meetings, comittees, etc., and in that same time, the chinese would start selling the actual phone.

Also elop.


I think its fine, the problem is they are too late to use Android OS which is favour their own OS like Symbian or whatever their PDA use at the time

if only they can copy iphone fast enough they would be a "samsung" what is today


This was it for me. I gave up on Nokia when everybody around was using their Android messaging apps while I had no apps for my Nokia, or they were a lesser version (like whatsapp).

Then, if I recall correctly, they went with Windows, so every move was forcing their users into a lonely corner.


Samsung is successfully using this car company strategy and they have a phone in almost every price band. Just look at sheer number different SKUs they have: https://www.gsmarena.com/samsung-phones-9.php


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: