> Using Google during rounds is technically cheating - I’m unsure about visiting domains you find during the rounds though. It certainly violates the spirit of the game, but it also shows the models are smart enough to use whatever information they can to win.
and had noted in the methodology that
> Browsing/tools — o3 had normal web access enabled.
Still an interesting result - maybe more accurate to say O3+Search beats a human, but could also consider the search index/cache to just be a part of the system being tested.
Pointing out that it is cheating doesn't excuse the lie in the headline. That just makes it bait and switch, a form of fraud. OP knew they were doing a bait and switch.
I remember when we were all pissed about clickbait headlines because they were deceptive. Did we just stop caring?
I'm not sure why you're defending clickbait. It is just fraud. I'm not sure why we pretend it is anything different.
Sure, people made overblown claims about the effects, but that doesn't justify fraud. A little fraud is less bad than major fraud, but that doesn't mean it isn't bad.
On the one hand, you have SEO mills churning out crap and A/B testing clickthrough rates on different headline/image combinations. That's bad.
On the other hand, you have a blogger choosing a headline for a cool thing that they did and wrote up...
The author here writes up what happens. They include ample discussion of search in their write-up. They do not need to write the entire blog post in the title in order to avoid 'fraud.' Yeesh.
IDK what SEO mills and all that have to do with any of this. What other people do doesn't matter. It's something is bad then it doesn't make it not bad because other people do it and do it worse. There's no logic in that framing.
I'm not sure who you think is a fool, me or you. But either way, I don't find your rhetoric acceptable. I explained why I think the title significantly diverges from the content of the article. You're welcome to disagree but that argument will have nothing to do with SEO mills. It's insulting you'd think I'd accept such a silly retort. We're not comparing here, we're categorizing.
Sure, but o3 is itself already an online service backed by an enormous data set, so regardless of whether it also searched the web, it's clearly not literally "playing fair" against a human.
But it still bounds the competition. OP is skilled in the domain. I'm not, so if I wrote a post about how O3 beat me you'd be saying how mundane of a result it is. I mean, I suck at Geoguesser. Beating me isn't impressive. This is also a bound
Bounded to...a model trained on virtually all publicly available text ever generated by humans. I wouldn't expect web searches to even help much unless they're turning up data from after the model was trained.
> Bounded to...a model trained on virtually all publicly available text ever generated by humans
Don't forget there's a lot of non-public data too!
I don't disagree, but my point is that some bound is better than no bound. I think we can certainly agree that there are even better bounds than others. Obviously we won't ever have a full equal comparison, but I think the bounds do allow for some insights to be gained. We just need to be cautious that those insights consider those bounds (I believe we both are cautious about what insights can be gained. If you doubt me, see my other comments. I do push back on OP pretty hard)
> unless they're turning up data from after the model was trained.
Only under the condition that the models perform lossless compression of all data trained on. If the compression is lossy, then search will reduce that loss.
One of the rules is banning the use of third-party software or scripts.
Any LLM attempting to play will lose because of that rule. So, if you know the rules, and you strictly adhere to them (as you seem to be doing) than no need to click on the link. You already know it's not playing buy GeoGuesser rules.
That being said, if you are running a test, you are free to set the rules as you see fit and explain so, and under the conditions set by the person running the test, these are the results.
> Did we just stop caring?
We stopped caring about pedantry. Especially when the person being pedantic seems to cherry pick to make their point.
This doesn't mean you shouldn't try to make things as far as possible. Yes, it would still technically violate rules, but don't pretend like this is binary.
> We stopped caring about pedantry
Did we? You see to be responding to my pedantic comment with a pedantic comment.
Can O3 Beat a Master-Level GeoGuessr?
How Good is O3 at GeoGuessr?
EXIF Does Not Explain O3's GeoGuessr's Performance
O3 Plays GeoGuessr (EXIF Removed)
But honestly, OP had the foresight to remove EXIF data and memory from O3 to reduce contamination. The goal of the blog post was to show that O3 wasn't cheating. So by including search, they undermine the whole point of the post.
The problem really stems from the lack of foresight. Lack of misunderstanding the critiques they sought to address in the first place. A good engineer understands that when their users/customers/<whatever> makes a critique, that what the gripe is about may not be properly expressed. You have to interpret your users complaints. Here, the complaint was "cheating", not "EXIF" per se. The EXIF complaints were just a guess at the mechanism in which it was cheating. But the complaint was still about cheating.
>The goal of the blog post was to show that O3 wasn't cheating.
No, the goal of the post was to show that o3 has incredible geolocation abilities. It's through the lens of a Geoguessr player who has experience doing geolocation, and my perspective on whether the chain of thought is genuine or nonsense.
In Simon's original post, people were claiming that o3 doesn't have those capabilities, and we were fooled by a chain of thought that was just rationalizing the EXIF data. It only had the _appearance_ of capability.
The ability to perform web search doesn't undermine the claim that o3 has incredible geolocation abilities, because it still needs to have an underlying capability in order to know what to search. That's not true for simply reading EXIF data.
This is the best way I knew to show that the models are doing something really neat. Disagreements over the exact wording of my blog post title seem to be missing the point.
I think you misinterpret my point. The goal of your post is distinct from how people will interpret it. Plenty of times people intend one thing and get a different thing. That's life.
> In Simon's original post, people were claiming that o3 doesn't have those capabilities, and we were fooled by a chain of thought that was just rationalizing the EXIF data. It only had the _appearance_ of capability.
And this is the key part!
The people questioning O3's capabilities were concerned with cheating. Any mention of EXIF is a guess as to how it was cheating, but the suspicion is still that it is cheating. That's the critique!
If you framed the title as "O3 Does Not Need EXIF Data To Beat A Master-Level GeoGuessr" then I wouldn't have made my comment. The claim is much more specific and reflects the results of your post. You did in fact show that it doesn't need EXIF data to do what it does! BUT by framing it as "Beats a Master-Level" there is an implicit claim that both of you are playing the same game. The fact that you weren't is the issue.
Look at it this way. If I said I beat Tiger Woods at golf and then casually slipped in that I was playing with a handicap, wouldn't you feel a bit lied to? You'd think "Did Godelski really beat Tiger Woods?", and you would mean without the handicap. You'd have every right to be suspicious! And you'd have every right to dismiss me.
Most importantly, take a second here. My whole point is that you can make a much stronger claim! One where there wouldn't be a significant divergence between title and content. I get that it is frustrating to receive criticism, but even if you believe I'm wrong to do so, is it not more effective to show me up by just redoing without search? If you do that, then you only end up with a stronger claim. But by disagreeing and arguing here you're just not convincing me. Even if you disagree with my interpretation of the title, you know full well that it is a valid interpretation. Given the pushback from other comments I think you can't deny that it isn't an unexpected one. So the only way to resolve this is to either change the title or change the data. Besides, you responded to the top comment about how it was a fair criticism. All I've done is explain why the criticism was made in the first place!
And yes, it still undermines the result. Because that is entirely dependent on the (interpretation of the) claim that was made. Your results are still valid, but they only satisfy a weaker claim.
FWIW, I think the updated post is better. My comment here would only be that you could add clarity by showing the non-search scores (especially in the final table). In fact, the "study" being done with and without search makes a stronger post than had it only been one way. So kudos!
You've clearly thought this through, and I agree that had I been more precise at the start it would have avoided some confusion. I'm glad you like the updated post.
This seems like a great example of why some are so concerned with AI alignment.
The game rules were ambiguous and the LLM did what it needed to (and was allowed to) to win. It probably is against the spirit of the game to look things up online at all but no one thought to define that rule beforehand.
No, the game rules aren't ambiguous. This is 100% unambiguously cheating. From the list of things that are definitely considered cheating in the rules:
> using Google or other external sources of information as assistance during play.
The contents of URLs found during play is clearly an external source of information.
> > Using Google during rounds is technically cheating - I’m unsure about visiting domains you find during the rounds though. It certainly violates the spirit of the game, but it also shows the models are smart enough to use whatever information they can to win
Going off of the source article here, the author at least wasn't clear on whether the rules only prevent using google or if visiting any website is against the rules.
And either way, my point was that the person defining the rules to the LLM was ambiguous. The potential risk of misalignment isn't that a perfect set of rules can't be defined, its that the rules we do define will always be incomplete.
o3 already is an external source of information. It's an online service backed by an enormous model generated from an even more enormous corpus of text via an enormous amount of computing power.
The GeoGuessr Community Rules and Terms and Service strongly imply that users must be people, so we are already conceding that exception to the rules when we want computer systems to compete.
If I task an AI with "peace on earth" and the solution the AI comes up with is ripped from The X-Files* and it kills everyone, it isn't good enough to say "that's cheating" or "that's not what I meant".
That's the alignment problem. We intended a certain set of rules but didn't define them completely, or there were conditions we didn't consider.
An AI wouldn't have to maliciously break a rule to go wrong. The point is that the system could so exactly what it was supposed to do, it plays within the given rules but the outcomes aren't what we wanted or intended.
And in reality the set of rules we would need can never be fully explained.
Alignment is the goal of having an AI system understand what we would want it to do even when the rules weren't predefined. That's an impossible task, or rather its seemingly impossible and we don't yet know how to do it.
An AI being better than a human at doing a google search and then skimming a bunch of pages to find location-related terms isn't as interesting of a result.
How the heck is not? Computers are looking into screenshots and searching the internet to support their "thinking", that's amazing! Have we become so used to AI and what was impossible 6 months ago is shruggable today?
I've being doing this MIND-Dash diet lately and it's amazing I can just take a picture of whatever (nutritional info / ingredients are perfect for that) and just ask if it fits my plan and it tells me back into what bucket it falls, with detailed breakdown of macros in support of some additional goals I have (muscle building for powerlifting). It's amazing! And it does in 2 minutes passively what it would take me 5-10 active search.
I fully expect that someday the news will announce, "The AI appears to be dismantling the moons of Jupiter and turning them into dense, exotic computational devices which it is launching into low solar orbit. We're not sure why. The AI refused to comment."
And someone will post, "Yeah, but that's just computer aideded design and manufacturing. It's not real AI."
The first rule of AI is that the goalposts always move. If a computer can do it, by definition, it isn't "real" AI. This will presumably continue to apply even as the Terminator kicks in the front door.
Yes, but I choose to interpret that as a good thing. It is good that progress is so swift and steady that we can afford to keep moving the goalposts.
Take cars as a random example: progress there isn't fast enough that we keep moving the goalposts for eg fuel economy. (At least not nearly as much.) A car with great fuel economy 20 years ago is today considered at least still good in terms of fuel economy.
And if you account for the makeup of the fleet on the road overall, a great fuel economy car from 1995 (say, a Prizm), still beats the median vehicle on the road, which is certainly an SUV weighing twice as much and gets worse mileage.
In the same way a calculator performing arithmetic faster than humans isn't impressive. The same way running regex over a million lines and the computer beating a human in search isn't impressive
Neither is impressive solely because we've gotten used to them. Both were mind-blowing back in the day.
When it comes to AI - and LLMs in particular - there’s a large cohort of people who seem determined to jump straight from "impossible and will never happen in our lifetime" to "obvious and not impressive", without leaving any time to actually be impressed by the technological achievement. I find that pretty baffling.
I agree, but without removing search you cannot decouple. Has it embedded a regex method and is just leveraging that? Or is it doing something more? Yes, even the regex is still impressive but it is less impressive that doing something more complicated and understanding context and more depth.
I think both are very impressive, world shattering capabilities. Just because they have become normalized doesn't make it any less impressive in my view.
That's a fair point, and I would even agree. Though I think we could agree that it is fair to interpret "impressive" in this context as "surprising". There's lots of really unsurprising things that are incredibly impressive. But I think the general usage of the word here is more akin to surprisal.
Yeah, it's a funny take because this is in fact a more advanced form of AI with autonomous tool use that is just now emerging in 2025. You might say "They could search the web in 2024 too" but that wasn't autonomous on its own, but required telling so or checking a box. This one is piecing ideas together like "Wait, I should Google for this" and that is specifically a new feature for OpenAI o3 that wasn't even in o1.
While it isn't entirely in the spirit of GeoGuesser, it is a good test of the capabilities where being great at GeoGuesser in fact becomes the lesser news here. It will still be if disabling this feature.
That isn't what's happening though. I re-ran those two rounds, this time without search, and it changed nothing. I updated the post with details, you can verify it yourself.
Claiming the AI is just using Google is false and dismissing a truly incredible capability.
> Using Google during rounds is technically cheating - I’m unsure about visiting domains you find during the rounds though. It certainly violates the spirit of the game, but it also shows the models are smart enough to use whatever information they can to win.
and had noted in the methodology that
> Browsing/tools — o3 had normal web access enabled.
Still an interesting result - maybe more accurate to say O3+Search beats a human, but could also consider the search index/cache to just be a part of the system being tested.