Deep Research doesn’t give the numbers that are in statcounter and statista. It’s choosing the wrong sources, but it’s also failing to represent them accurately.
Wow, that's really surprising. My experience with much simpler RAG workflows is that once you stick a number in the context the LLMs can reliably parrot that number back out again later on.
Presumably Deep Research has a bunch of weird multi-LLM-agent things going on, maybe there's something about their architecture that makes it more likely for mistakes like that to creep in?
Have a look at the previous essay. I couldn't get ChatGPT 4o to give me a number in a PDF correctly even when I gave it the PDF, the page number, and the row and column.
ChatGPT treats a PDF upload as a data extraction problem, where it first pulls out all of the embedded textual content on the PDF and feeds that into the model.
This fails for PDFs that contain images of scanned documents, since ChatGPT isn't tapping its vision abilities to extract that information.
Claude (and Gemini) both apply their vision capabilities to PDF content, so they can "see" the data.
So my hunch is that ChatGPT couldn't extract useful information from the PDF you provided and instead fell back on whatever was in its training data, effectively hallucinating a response and pretending it came from the document.
That's a huge failure on OpenAI's behalf, but it's not illustrative of models being unable to interpret documents: it's illustrative of OpenAI's ChatGPT PDF feature being unable to extract non-textual image content (and then hallucinating on top of that inability).
Interesting, thanks.
I think the higher level problem is that 1: I have no way to know this failure mode when using the product and 2: I don't really know if I can rely on Claude to get this right every single time either, or what else it would fail at instead.
Yeah, completely understand that. I talked about this problem on stage as an illustration of how infuriatingly difficult these tools are to use because of the vast number of weird undocumented edge cases like this.
This is an unfortunate example though because it undermines one of the few ways in which I've grown to genuinely trust these models: I'm confident that if the model is top tier it will reliably answer questions about information I've directly fed into the context.
[... unless it's GPT-4o and the content was scanned images bundled in a PDF!]
It's also why I really care that I can control the context and see what's in it - systems that hide the context from me (most RAG systems, search assistants etc) leave me unable to confidently tell what's been fed in, which makes them even harder for me to trust.
1: It's the TITLE of a 100 slide presentation. It's not the only thing it said, and it's a way to think about what was happening.
2: Mobile replaced the PC as the main way people use the internet and do their day-to-day computing. The consumer Internet runs on smartphone apps, not PCs. In 2013 a lot of people didn't understand that that was happening, so it was worth saying.
Well, I'm not trying to explain the state of the science and the engineering, but to work out what this means to everyone else. There are no products to analyse yet - which is part of the problem.
ChatGPT got to 100m users much faster than anything else because it's riding on all the infrastructure we already built in the last 20 years. To a consumer, it's 'just' a website, and you don't have to wait for telcos to build broadband networks or get everyone to buy a $600 smartphone.
But, most people go to the website and say 'well, that's very cool, but I don't know what I'd use it for'. It's very useful for coding and marketing, and a few general purposes, but it isn't - YET - very helpful for most of the things that most people do all day. A lot of the presentation is wondering about this.
Only OpenAI knows for sure, but so many non-tech people I know use ChatGPT for a sounding board for whatever. "My boyfriend sent me this text, how should I respond?" or "Teach me about investing." There are a bunch of people I know that don't use ChatGPT, I'm just surprised at the uptake by people who I didn't think would have as use for it have found it very useful.
Robots can already do that, they are used at large chains (McDonalds) and they are used all the time.
What they can't do is call the police when the hobo gets too wild, can't fix the inevitable bug in the process(by doing some 4th level menu bypass) and other random stuff that might pop up.
And when the robot can do all that humans are no longer viable as economic entities and will be out competed.
I tried to capture this on the last slide before the conclusion - maybe all AI questions have one of two answers - "no-one knows" or "it will be the same as the last time"
The question I'm asking isn't whether hallucinations can be fixed. It's what, if they are not fixed, are the economic consequences for the industry? How necessary is it that LLMs become trustworthy? How much valuation assumes that they will?
The "hallucinations" problem feels to me like an inherent feature. For LLMs to have interesting output the temperature needs to be higher then zero. The whole system is interesting because it is probabilistic. "Hallucinations" (hate the word btw) are to LLMs as melting is to ice. There will be no 'meltless' ice because the melting is what makes it cold and useful.
Nokia thought that phones would be like cars. Most car companies have dozens of models with different characteristics and prices, and that's fine. Even BMW has 20(!) top-level model categories. Nokia was selling devices all the way from high-end camera/phones for $$$ right down to $10 (unsubsidised) with two week battery life for emerging markets. I think you can argue that 50 was too many, but 10 or 20 was reasonable and still is - plenty of successful Android OEMs have ranges like that.
The real shift in phone design and in the range was that the screen took over the whole front of the device. That meant there was much less scope for different shapes and form-factors, and since you were no longer using most of the front for casing and keys there was less to design anyway.
Meanwhile all the actual phones ran either Series 40, the classic feature phone OS that won them half the entire market as 'easy to use', or series 60, the smartphone OS, that was an actual platform but problematic (and fragmented) in lots of ways.
Personally, I think Nokia didn't really think anything. The management was just incompetent and they didn't have any real strategy. They had 100 teams of engineers each doing their own phone model, and were just too lazy to fire unnecessary teams and focus. Somewhere deep down they knew, that it is stupid way to do things. However the money kept flowing in, so who cares.
It was also slow and bloated... and china happened. Nokia guys would start making a mockup and a powerpoint for a new product, meetings, comittees, etc., and in that same time, the chinese would start selling the actual phone.
This was it for me. I gave up on Nokia when everybody around was using their Android messaging apps while I had no apps for my Nokia, or they were a lesser version (like whatsapp).
Then, if I recall correctly, they went with Windows, so every move was forcing their users into a lonely corner.
Samsung is successfully using this car company strategy and they have a phone in almost every price band. Just look at sheer number different SKUs they have: https://www.gsmarena.com/samsung-phones-9.php