The author did specifically point out that > Using Google during rounds is techn...

godelski · on April 29, 2025

Pointing out that it is cheating doesn't excuse the lie in the headline. That just makes it bait and switch, a form of fraud. OP knew they were doing a bait and switch.

I remember when we were all pissed about clickbait headlines because they were deceptive. Did we just stop caring?

sdenton4 · on April 29, 2025

The people pissed about clickbait headlines were often overstating things to drum up outrage and accumulate more hacker news upboats...

godelski · on April 29, 2025

I'm not sure why you're defending clickbait. It is just fraud. I'm not sure why we pretend it is anything different.

Sure, people made overblown claims about the effects, but that doesn't justify fraud. A little fraud is less bad than major fraud, but that doesn't mean it isn't bad.

sdenton4 · on April 30, 2025

On the one hand, you have SEO mills churning out crap and A/B testing clickthrough rates on different headline/image combinations. That's bad.

On the other hand, you have a blogger choosing a headline for a cool thing that they did and wrote up...

The author here writes up what happens. They include ample discussion of search in their write-up. They do not need to write the entire blog post in the title in order to avoid 'fraud.' Yeesh.

godelski · on April 30, 2025

IDK what SEO mills and all that have to do with any of this. What other people do doesn't matter. It's something is bad then it doesn't make it not bad because other people do it and do it worse. There's no logic in that framing.

I'm not sure who you think is a fool, me or you. But either way, I don't find your rhetoric acceptable. I explained why I think the title significantly diverges from the content of the article. You're welcome to disagree but that argument will have nothing to do with SEO mills. It's insulting you'd think I'd accept such a silly retort. We're not comparing here, we're categorizing.

sdenton4 · on April 30, 2025

In short, the title of the piece is in no way fraudulent: No one is being defrauded here, and I don't believe there was any intent to defraud anyone.

The title is not click bait. Titles might be better or worse for their content, but imperfection is not fraud.

godelski · on April 30, 2025

I'm sorry. You have failed the assignment. You have written an assertion, not an argument. I'm not willing to just take your word on it.

But thank you for clarifying the implicit question in my earlier comment.

627467 · on April 29, 2025

Cheating implies there's a game. There isn't.

> Titles and headlines grab attention, summarize content, and entice readers to engage with the material

I'm sorry you felt defrauded instead. To me the title was very good at conveying to me the ability of o3 in geolocating photos.

hatthew · on April 29, 2025

Title says o3 beat a [human] player. That implies there is some competition that has the capacity to be fair or unfair.

tshaddox · on April 29, 2025

Sure, but o3 is itself already an online service backed by an enormous data set, so regardless of whether it also searched the web, it's clearly not literally "playing fair" against a human.

godelski · on April 29, 2025

But it still bounds the competition. OP is skilled in the domain. I'm not, so if I wrote a post about how O3 beat me you'd be saying how mundane of a result it is. I mean, I suck at Geoguesser. Beating me isn't impressive. This is also a bound

tshaddox · on April 30, 2025

> But it still bounds the competition.

Bounded to...a model trained on virtually all publicly available text ever generated by humans. I wouldn't expect web searches to even help much unless they're turning up data from after the model was trained.

godelski · on April 30, 2025

  > Bounded to...a model trained on virtually all publicly available text ever generated by humans

Don't forget there's a lot of non-public data too!

I don't disagree, but my point is that some bound is better than no bound. I think we can certainly agree that there are even better bounds than others. Obviously we won't ever have a full equal comparison, but I think the bounds do allow for some insights to be gained. We just need to be cautious that those insights consider those bounds (I believe we both are cautious about what insights can be gained. If you doubt me, see my other comments. I do push back on OP pretty hard)

  > unless they're turning up data from after the model was trained.

Only under the condition that the models perform lossless compression of all data trained on. If the compression is lossy, then search will reduce that loss.

sebzim4500 · on April 29, 2025

Presumably being an AI is technically against the GeoGuessr rules so I don't see how there can be an expectation that those rules were followed.

jasonlotito · on April 29, 2025

One of the rules is banning the use of third-party software or scripts.

Any LLM attempting to play will lose because of that rule. So, if you know the rules, and you strictly adhere to them (as you seem to be doing) than no need to click on the link. You already know it's not playing buy GeoGuesser rules.

That being said, if you are running a test, you are free to set the rules as you see fit and explain so, and under the conditions set by the person running the test, these are the results.

> Did we just stop caring?

We stopped caring about pedantry. Especially when the person being pedantic seems to cherry pick to make their point.

godelski · on April 29, 2025

This doesn't mean you shouldn't try to make things as far as possible. Yes, it would still technically violate rules, but don't pretend like this is binary.

  > We stopped caring about pedantry

Did we? You see to be responding to my pedantic comment with a pedantic comment.

kenjackson · on April 29, 2025

Technically the LLM is 3rd party software so the use of it is cheating. QED

bahmboo · on April 29, 2025

The headline said the AI beat him, it did not say it beat him in a GeoGuessr game. The article clearly states what he did and why.

SecretDreams · on April 29, 2025

What's your suggestion for an alternative headline?

godelski · on April 29, 2025

  Can O3 Beat a Master-Level GeoGuessr?
  How Good is O3 at GeoGuessr?
  EXIF Does Not Explain O3's GeoGuessr's Performance
  O3 Plays GeoGuessr (EXIF Removed)

But honestly, OP had the foresight to remove EXIF data and memory from O3 to reduce contamination. The goal of the blog post was to show that O3 wasn't cheating. So by including search, they undermine the whole point of the post.

The problem really stems from the lack of foresight. Lack of misunderstanding the critiques they sought to address in the first place. A good engineer understands that when their users/customers/<whatever> makes a critique, that what the gripe is about may not be properly expressed. You have to interpret your users complaints. Here, the complaint was "cheating", not "EXIF" per se. The EXIF complaints were just a guess at the mechanism in which it was cheating. But the complaint was still about cheating.

SamPatt · on April 29, 2025

>The goal of the blog post was to show that O3 wasn't cheating.

No, the goal of the post was to show that o3 has incredible geolocation abilities. It's through the lens of a Geoguessr player who has experience doing geolocation, and my perspective on whether the chain of thought is genuine or nonsense.

In Simon's original post, people were claiming that o3 doesn't have those capabilities, and we were fooled by a chain of thought that was just rationalizing the EXIF data. It only had the _appearance_ of capability.

The ability to perform web search doesn't undermine the claim that o3 has incredible geolocation abilities, because it still needs to have an underlying capability in order to know what to search. That's not true for simply reading EXIF data.

This is the best way I knew to show that the models are doing something really neat. Disagreements over the exact wording of my blog post title seem to be missing the point.

godelski · on April 29, 2025

  > No, the goal of the post was to

I think you misinterpret my point. The goal of your post is distinct from how people will interpret it. Plenty of times people intend one thing and get a different thing. That's life.

  > In Simon's original post, people were claiming that o3 doesn't have those capabilities, and we were fooled by a chain of thought that was just rationalizing the EXIF data. It only had the _appearance_ of capability.

And this is the key part!

The people questioning O3's capabilities were concerned with cheating. Any mention of EXIF is a guess as to how it was cheating, but the suspicion is still that it is cheating. That's the critique!

If you framed the title as "O3 Does Not Need EXIF Data To Beat A Master-Level GeoGuessr" then I wouldn't have made my comment. The claim is much more specific and reflects the results of your post. You did in fact show that it doesn't need EXIF data to do what it does! BUT by framing it as "Beats a Master-Level" there is an implicit claim that both of you are playing the same game. The fact that you weren't is the issue.

Look at it this way. If I said I beat Tiger Woods at golf and then casually slipped in that I was playing with a handicap, wouldn't you feel a bit lied to? You'd think "Did Godelski really beat Tiger Woods?", and you would mean without the handicap. You'd have every right to be suspicious! And you'd have every right to dismiss me.

Most importantly, take a second here. My whole point is that you can make a much stronger claim! One where there wouldn't be a significant divergence between title and content. I get that it is frustrating to receive criticism, but even if you believe I'm wrong to do so, is it not more effective to show me up by just redoing without search? If you do that, then you only end up with a stronger claim. But by disagreeing and arguing here you're just not convincing me. Even if you disagree with my interpretation of the title, you know full well that it is a valid interpretation. Given the pushback from other comments I think you can't deny that it isn't an unexpected one. So the only way to resolve this is to either change the title or change the data. Besides, you responded to the top comment about how it was a fair criticism. All I've done is explain why the criticism was made in the first place!

And yes, it still undermines the result. Because that is entirely dependent on the (interpretation of the) claim that was made. Your results are still valid, but they only satisfy a weaker claim.

FWIW, I think the updated post is better. My comment here would only be that you could add clarity by showing the non-search scores (especially in the final table). In fact, the "study" being done with and without search makes a stronger post than had it only been one way. So kudos!

SamPatt · on April 30, 2025

You've clearly thought this through, and I agree that had I been more precise at the start it would have avoided some confusion. I'm glad you like the updated post.

jahsome · on April 29, 2025

[flagged]

halfmatthalfcat · on April 29, 2025

Look up "royal we".

NineWillows · on April 29, 2025

We all know about the "royal we". We still don't appreciate your (the "royal your") usage of it.

halfmatthalfcat · on April 29, 2025

Do "we all" know?

NineWillows · on April 29, 2025

whoosh

_heimdall · on April 29, 2025

This seems like a great example of why some are so concerned with AI alignment.

The game rules were ambiguous and the LLM did what it needed to (and was allowed to) to win. It probably is against the spirit of the game to look things up online at all but no one thought to define that rule beforehand.

umanwizard · on April 29, 2025

No, the game rules aren't ambiguous. This is 100% unambiguously cheating. From the list of things that are definitely considered cheating in the rules:

> using Google or other external sources of information as assistance during play.

The contents of URLs found during play is clearly an external source of information.

_heimdall · on April 30, 2025

> > Using Google during rounds is technically cheating - I’m unsure about visiting domains you find during the rounds though. It certainly violates the spirit of the game, but it also shows the models are smart enough to use whatever information they can to win

Going off of the source article here, the author at least wasn't clear on whether the rules only prevent using google or if visiting any website is against the rules.

And either way, my point was that the person defining the rules to the LLM was ambiguous. The potential risk of misalignment isn't that a perfect set of rules can't be defined, its that the rules we do define will always be incomplete.

tshaddox · on April 29, 2025

o3 already is an external source of information. It's an online service backed by an enormous model generated from an even more enormous corpus of text via an enormous amount of computing power.

Filligree · on April 29, 2025

o3 was the thing beating GeoGuessr. It isn't external to itself.

tshaddox · on April 30, 2025

Sure, then o3 plus the World Wide Web was the thing playing the game, and also isn’t external to itself.

umanwizard · on April 30, 2025

Right, and that’s indeed impressive! But still not what’s claimed in the headline.

tshaddox · on April 30, 2025

That's debatable, given that searching the web is a standard feature of o3.

umanwizard · on April 30, 2025

Fair enough, but searching the web is also a standard feature of humans, but explicitly prohibited when playing GG.

tshaddox · on May 1, 2025

The GeoGuessr Community Rules and Terms and Service strongly imply that users must be people, so we are already conceding that exception to the rules when we want computer systems to compete.

GaggiX · on April 29, 2025

I believed the rules were not explained to the model so it does use what it can.

misnome · on April 29, 2025

Then you can 100% not claim it is “Playing” the game

ben_w · on April 29, 2025

That right there *is the alignment problem*.

If I task an AI with "peace on earth" and the solution the AI comes up with is ripped from The X-Files* and it kills everyone, it isn't good enough to say "that's cheating" or "that's not what I meant".

* https://en.wikipedia.org/wiki/Je_Souhaite

_heimdall · on April 30, 2025

That's the alignment problem. We intended a certain set of rules but didn't define them completely, or there were conditions we didn't consider.

An AI wouldn't have to maliciously break a rule to go wrong. The point is that the system could so exactly what it was supposed to do, it plays within the given rules but the outcomes aren't what we wanted or intended.

GaggiX · on April 29, 2025

It's playing a game in which the rules are a bit ambiguous if not explained.

_heimdall · on April 30, 2025

And in reality the set of rules we would need can never be fully explained.

Alignment is the goal of having an AI system understand what we would want it to do even when the rules weren't predefined. That's an impossible task, or rather its seemingly impossible and we don't yet know how to do it.

spookie · on April 29, 2025

A human can also use the same tools if it wasn't for the rules or fair play. They should've simply redone the test.

ceph_ · on April 29, 2025

The AI should be forced to use the same rules as the human. Not the other way around. The AI shouldn't be using outside resources.

voxic11 · on April 29, 2025

Another rule bans "using third-party software or scripts in order to gain an unfair advantage over other players."

So is it even possible for O3 to beat another player while complying with the rules?

ben_w · on April 29, 2025

If a player uses such a model, the model is third-party and the player is cheating.

But: when a specific model is itself under test, I would say that during the test it becomes "first" (or second?) party rather than "third".

voxic11 · on May 1, 2025

I think third-party here means not produced by GeoGuesser rather than not produced by the player.

bscphil · on April 29, 2025

I think that's part of the point they're making, hence "They should've simply redone the test."

krferriter · on April 29, 2025

An AI being better than a human at doing a google search and then skimming a bunch of pages to find location-related terms isn't as interesting of a result.

inerte · on April 29, 2025

How the heck is not? Computers are looking into screenshots and searching the internet to support their "thinking", that's amazing! Have we become so used to AI and what was impossible 6 months ago is shruggable today?

I've being doing this MIND-Dash diet lately and it's amazing I can just take a picture of whatever (nutritional info / ingredients are perfect for that) and just ask if it fits my plan and it tells me back into what bucket it falls, with detailed breakdown of macros in support of some additional goals I have (muscle building for powerlifting). It's amazing! And it does in 2 minutes passively what it would take me 5-10 active search.

ekidd · on April 29, 2025

I fully expect that someday the news will announce, "The AI appears to be dismantling the moons of Jupiter and turning them into dense, exotic computational devices which it is launching into low solar orbit. We're not sure why. The AI refused to comment."

And someone will post, "Yeah, but that's just computer aideded design and manufacturing. It's not real AI."

The first rule of AI is that the goalposts always move. If a computer can do it, by definition, it isn't "real" AI. This will presumably continue to apply even as the Terminator kicks in the front door.

eru · on April 29, 2025

Yes, but I choose to interpret that as a good thing. It is good that progress is so swift and steady that we can afford to keep moving the goalposts.

Take cars as a random example: progress there isn't fast enough that we keep moving the goalposts for eg fuel economy. (At least not nearly as much.) A car with great fuel economy 20 years ago is today considered at least still good in terms of fuel economy.

xp84 · on April 30, 2025

And if you account for the makeup of the fleet on the road overall, a great fuel economy car from 1995 (say, a Prizm), still beats the median vehicle on the road, which is certainly an SUV weighing twice as much and gets worse mileage.

godelski · on April 29, 2025

In the same way a calculator performing arithmetic faster than humans isn't impressive. The same way running regex over a million lines and the computer beating a human in search isn't impressive

ludwik · on April 29, 2025

Neither is impressive solely because we've gotten used to them. Both were mind-blowing back in the day.

When it comes to AI - and LLMs in particular - there’s a large cohort of people who seem determined to jump straight from "impossible and will never happen in our lifetime" to "obvious and not impressive", without leaving any time to actually be impressed by the technological achievement. I find that pretty baffling.

godelski · on April 29, 2025

I agree, but without removing search you cannot decouple. Has it embedded a regex method and is just leveraging that? Or is it doing something more? Yes, even the regex is still impressive but it is less impressive that doing something more complicated and understanding context and more depth.

rowanG077 · on April 29, 2025

I think both are very impressive, world shattering capabilities. Just because they have become normalized doesn't make it any less impressive in my view.

godelski · on April 29, 2025

That's a fair point, and I would even agree. Though I think we could agree that it is fair to interpret "impressive" in this context as "surprising". There's lots of really unsurprising things that are incredibly impressive. But I think the general usage of the word here is more akin to surprisal.

jug · on April 29, 2025

Yeah, it's a funny take because this is in fact a more advanced form of AI with autonomous tool use that is just now emerging in 2025. You might say "They could search the web in 2024 too" but that wasn't autonomous on its own, but required telling so or checking a box. This one is piecing ideas together like "Wait, I should Google for this" and that is specifically a new feature for OpenAI o3 that wasn't even in o1.

While it isn't entirely in the spirit of GeoGuesser, it is a good test of the capabilities where being great at GeoGuesser in fact becomes the lesser news here. It will still be if disabling this feature.

SamPatt · on April 29, 2025

That isn't what's happening though. I re-ran those two rounds, this time without search, and it changed nothing. I updated the post with details, you can verify it yourself.

Claiming the AI is just using Google is false and dismissing a truly incredible capability.

arandomhuman · on April 29, 2025

But then they couldn't make a click bait title for the article.