Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

To be fair a lot of the impressive Elo scores models get are simply due to the fact that they're faster: many serious competitive coders could get the same or better results given enough time.

But seeing these results I'd be surprised if by the end of the decade we don't have something that is to these puzzles what Stockfish is to chess. Effectively ground truth and often coming up with solutions that would be absolutely ridiculous for a human to find within a reasonable time limit.



I’d love if anyone could provide examples of such AND(“ground truth”, “absolutely ridiculous”) solutions! Even if they took clever humans a long time to create.

I’m curious to explore such fun programming code. But I’m also curious to explore what knowledgeable humans consider to be both “ground truth” as well as “absolutely ridiculous” to create within the usual time constraints.


I'm not explaining myself right.

Stockfish is a superhuman chess program. It's routinely used in chess analysis as "ground truth": if Stockfish says you've made a mistake, it's almost certain you did in fact make a mistake[0]. Also, because it's incomparably stronger than even the very best humans, sometimes the moves it suggests are extremely counterintuitive and it would be unrealistic to expect a human to find them in tournament conditions.

Obviously software development in general is way more open-ended, but if we restrict ourselves to puzzles and competitions, which are closed game-like environments, it seems plausible to me that a similar skill level could be achieved with an agent system that's RL'd to death on that task. If you have base models that can get there, even inconsistently so, and an environment where making a lot of attempts is cheap, that's the kind of setup that RL can optimize to the moon and beyond.

I don't predict the future and I'm very skeptical of anybody who claims to do so, correctly predicting the present is already hard enough, I'm just saying that given the progress we've already made I would find plausible that a system like that could be made in a few years. The details of what it would look like are beyond my pay grade.

---

[0] With caveats in endgames, closed positions and whatnot, I'm using it as an example.


Yeah, it is often pointed out as a brilliance in game analysis if a GM makes a move that an engine says is bad and turns out to be good. However, it only happens in very specific positions.


Does that happen because the player understands some tendency of their opponent that will cause them to not play optimally? Or is it genuinely some flaw in the machine’s analysis?


Both, but perhaps more often neither.

From what I've seen, sometimes the computer correctly assesses that the "bad" move opens up some kind of "checkmate in 45 moves" that could technically happen, but requires the opponent to see it 45 moves ahead of time and play something that would otherwise appear to be completely sub-optimal until something like 35 moves in, at which point normal peak grandmasters would finally go "oh okay now I get the point of all of that confusing behavior, and I can now see that I'm going to get mated in 10 moves".

So, the computer is "right" - that move is worse if you're playing a supercomputer. But it's "wrong" because that same move is better as long as you're playing a human, who will never be able to see an absurd thread-the-needle forced play 45-75 moves ahead.

That said, this probably isn't what GP was referring to, as it wouldn't lead to an assignment of a "brilliant" move simply for failing to see the impossible-to-actually-play line.


This is similar to game theory optimal poker. The optimal move is predicated on later making optimal moves. If you don’t have that ability (because you’re human) then the non-optimal move is actually better.

Poker is funny because you have humans emulating human-beating machines, but that’s hard enough to do that you have players who don’t do this win as well.


I think this is correct for modern engines. Usually, these moves are open to a very particular line of counterplay that no human would ever find because they rely on some "computer" moves. Computer moves are moves that look dumb and insane but set up a very long line that happens to work.


It does happen that the engine doesn't immediately see that a line is best, but that's getting very rare those days. It was funny in certain positions a few years back to see the engine "change its mind" including in older games where some grandmaster found a line that was particularly brilliant, completely counter-intuitive even for an engine, AND correct.

But mostly what happens is that a move isn't so good, but it isn't so bad either, and as the computer will tell you it is sub-optimal, a human won't be able to refute it in finite time and his practical (as opposed to theoretical) chances are reduced. One great recent example of that is Pentala Harikrishna's recent queen sacrifice in the world cup, amazing conception of a move that the computer say is borderline incorrect, but leads to such complications and a very uncomfortable position for his opponent that it was practically a great choice.


It can be either one. In closed positions, it is often the latter.


It's only the later if it's a weak browser engine, and it's early enough in the game that the player had studied the position with a cloud engine.


> Yeah, it is often pointed out as a brilliance in game analysis if a GM makes a move that an engine says is bad and turns out to be good.

Do you have any links? I haven't seen any such (forget GM, not even Magnus), barring the opponent making mistakes.


Here’s a chess stackexchange of positions that stump engines

https://chess.stackexchange.com/questions/29716/positions-th...

It basically comes down to “ideas that are rare enough that they were never programmed into a chess engine”.

Blockades or positions where no progress is possible are a common theme. Engines will often keep tree searching where a human sees an obvious repeating pattern.

Here’s also an example where 2 engines are playing, and deep mind finds a move that I think would be obvious to most grandmasters, yet stockfish misses it https://youtu.be/lFXJWPhDsSY?si=zaLQR6sWdEJBMbIO

That being said, I’m not sure that this necessarily correlates with brilliancy. There are a few of these that I would probably get in classical time and I’m not a particularly brilliant player.


Stockfish totally dropped hand crafted evaluations in 2023.


It’s still the case that the evaluation model hasn’t seen enough examples of a blockade to be able to understand it as far as I can tell. Some very simple ones it can (in fact I’ve seen stockfish/alpha-zero execute quite clever blockades before). But there’s still a gap where humans understand them better.


It used to happen way more often with Magnus and classical versions of Stockfish from pre Alpha Zero/Leela Zero days. Since NN Stockfish I don't think it happens anymore.


Maybe he means not the best move but an equally almost strong move?

Because ya, that doesn't happen lol.


I would love to examine Stockfish play that seemed extremely counterintuitive but which ended up winning. How can I do so? (I don't inhabit any of the current chess spaces so have no idea where to look, but my son is approaching the age where I can start to teach him...).

That said, chess is such a great human invention. (Go is up there too. And texas no-limit hold'em poker. Those are my top 3 votes for "best human tabletop games ever invented". They're also, perhaps not uncoincidentally, the hardest for computers to be good at. Or, were.)


The problem is that Stockfish is so strong that the only way to have it play meaningful games is to put it against other computers. Chess engines play each other in automated competitions like TCEC.

If you look on Youtube there are many channels where strong players analyze these games. As Demis Hassabis once put it, it's like chess from another dimension.


> I would love to examine Stockfish play that seemed extremely counterintuitive but which ended up winning.

If you want to see this against someone like Magnus, it is rare as super GMs do not spend a lot of time playing engines publicly.

But if you want to see them against a normal chess master somewhere between master and international master, it is every where. For e.g. this guy analyses his every match afterwards and you frequently here "oh I would never see that line":

https://www.youtube.com/playlist?list=PLp7SLTJhX1u6zKT5IfRVm...

(start watching around 1000+ for frequently seeing those moments)


I recommend Matthew Sadler's Game Changer and The Silicon Road To Chess Improvement.


You explained yourself right. The issue is that you keep qualifying your statements.

> it suggests are extremely counterintuitive and it would be unrealistic to expect a human to find them...

> ... in tournament conditions.

I'm suggesting that I'd like to see the ones that humans have found - outside of tournament conditions. Perhaps the gulf between us arises from an unspoken reference to solutions "unrealistic to expect a human to find" without the window-of-time qualifier?


I can wreck stockfish in chess boxing. Mostly because stockfish can't box, and it's easy for me to knock over a computer.


If it runs on a mainframe you would lose both the chess and the boxing.


Are there really boxing capable mainframes nowadays?

Otherwise I think the mainframe would lose because of being too passive


The point of that qualifier is that you can expect to see weird moves outside of tournament conditions because casual games are when people experiment when that kind of thing.


How are they faster? I don’t think any ELO report actually comes from participating at a live coding contest on previously unseen problems.


My background is more on math competitions, but all of those things are essentially speed contests. The skill comes from solving hard problems within a strict time limit. If you gave people twice the time, they'd do better, but time is never going to be an issue for a computer.

Comparing raw Elo ratings isn't very indicative IMHO, but I do find it plausible that in closed, game-like environments models could indeed achieve the superhuman performance the Elo comparison implies, see my other comment in this thread.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: