Impressed by the model so far. As far as independent testing goes, it is topping...

thrance · on May 13, 2024

Nice project! Are you aware of the following investigations: https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/

Some have been able to achieve greater elo with a different prompt based on the pgn format.

gpt-3.5-turbo-instruct was able to reach an elo of ~1750.

chermanowicz · on May 14, 2024

On one hand, some of these results are impressive; on the other, the illegal moves count is alarming - it suggests no reasoning ability as there should never be an illegal move? I mean, how could a violation of a fairly basic game (from a rules perspective) be acceptable in assigning any 'outcome' to a model other than failure?

freediver · on May 14, 2024

Agreed, this is what makes evaluating this very hard. A 1700 Elo chess player would never make an illegal move, let alone have 12% illegal moves.

So from the model's perspective, we have at the same time display of both brilliancy (most 1700 chess players would not be able to solve as many puzzles by looking just at the FEN notation) and on the other side complete lack of any understanding of what is it trying to do from a fundamental, human-reasoning level.

vbezhenar · on May 14, 2024

That's because LLM does not reason. For me, as a layman, that seems strange that they don't wire some kind of Prolog engine to fill the gap, (like they wired Python to fill the gap in arithmetic) but probably it's not that easy.

jodrellblank · on May 14, 2024

Prolog doesn’t reason either, it does a simple brute force search over all possible states of your code and if that’s not fast enough it can table (cache, memoize) previous states.

People build reasoning engines from it, in the same way they do with Python and LISPs.

eggdaft · on May 14, 2024

What do you mean by “an LLM doesn’t reason”?

vbezhenar · on May 14, 2024

I mean that it does not follow basic logic rules when constructing its thoughts. For many tasks they'll get it right, however it's not that hard to find a task for which LLM will yield obviously logically wrong answer. That would be impossible for human with basic reasoning.

eggdaft · on May 15, 2024

I disagree, but I don’t have a cogent argument yet. So I can’t really refute you.

What I can say is, I think there’s a very important disagreement here and it divides nerds into two camps. The first think LLMs can reason, the second don’t.

It’s very important to resolve this debate, because if the former are correct then we are likely very close to AGI historically speaking (<10 years). If not, then this is just a stepwise improvement and we will now plateaux until the next level of sophistication of model or computer power etc is achieved.

I think a lot of very smart people are in the second camp. But they are biased by their overestimation of human cognition. And that bias might be causing them to misjudge the most important innovation in history. An innovation that will certainly be more impactful than the steam engine and may be more dangerous than the atomic bomb.

We should really resolve this argument asap so we can all either breathe a sigh of relief or start taking the situation very very seriously.

vbezhenar · on May 15, 2024

I'm actually in the first camp. For I believe that our brains is really LLM on steroids and logic rules are just in our "prompt".

What we need is a LLM that will iterate over its output until it feels that it's correct. Right now LLM output is like random thought in my mind. Which might be true or not. Before writing forum post I'd think it twice. And may be I'll rewrite the post before submitting it. And when I'm solving a complex problem, it might take weeks and thousands of iterations. Even reading math proof might take a lot of effort. LLM should learn to do it. I think that's the key to imitating human intelligence.

mistrial9 · on May 14, 2024

my guess is -- the probabilistic engine does sequence variation and it just will not do anything else.. so a simple A->B sort of logic is elusive at a deep level; secondly the adaptive and very broad kinds of questions and behaviors it handles, also make it difficult to write logic that could correct defective answers to simple logic.

elicksaur · on May 13, 2024

> and Kagi is well positioned to serve this need.

>CEO & founder of Kagi

Important context for anyone like me who was wondering where the boldness of the first statement was coming from.

Edit: looks like the parent has been edited to remove the claim I was responding to.

muzani · on May 14, 2024

My favorite part of HN is just casually bumping into domain experts and celebrities without realizing it. No profile pic is such a feature.

freediver · on May 13, 2024

Yeah, it was an observation that was better suited for a tweet than HN. Here it is:

https://twitter.com/vladquant/status/1790130917849137612

elicksaur · on May 13, 2024

Thanks for the transparency!

Powdering7082 · on May 13, 2024

Wow from adjusted ELO of 1144 to 1790, that's a huge leap. I wonder if they are giving it access to a 'scratch pad'

muzani · on May 14, 2024

My guess is that handling visual stuff directly accidentally gives it some powers similar to Beth Harmon.

borgdefense · on May 14, 2024

I wasn't impressed in the first 5 minutes of using it but it is quite impressive after 2 solid hours of random topics.

Much faster for sure but I have also not had anything give an error in python with jupyter. Usually you could only stray so far with more obscure python libraries before it starts producing errors.

That much better than 4 in chess is pretty shocking in a great way.

mewpmewp2 · on May 13, 2024

I see you have Connect 4 test there.

I tried playing against the model, it didn't do well in terms of blocking my win.

However it feels like it might be possible to make it try to think ahead in terms of making sure that all the threats are blocked by prompting well.

Maybe that could lead to somewhere, where it will explain its reasoning first?

This prompt worked for me to get it to block after I put 3 in the 4th column. It otherwise didn't

Let's play connect 4. Before your move, explain your strategy concisely. Explain what you must do to make sure that I don't win in the next step, as well as explain what your best strategy would be. Then finally output the column you wish to drop. There are 7 columns.

Always respond with JSON of the following format:

type Response ={

      am_i_forced_to_block: boolean;

      other_considerations: string[];

      explanation_for_the_move: string;

      column_number: number;

}

I start with 4.

Edit:

So it went

Me: 4

It: 3

Me: 4

It: 3

Me: 4

It: 4 - Successful block

Me: 5

It: 3

Me: 6 - Intentionally, to see if it will win by putting another 3.

It: 2 -- So here it failed, I will try to tweak the prompt to add more instructions.

me: 4

freediver · on May 13, 2024

Care to add a PR?

mewpmewp2 · on May 13, 2024

I just did it in the playground to test out actually, but it still seems to fail/lose state after some time. Right now where I got a win was after:

        [{ "who": "you", "column": 4 },
        { "who": "me", "column": 3 },
        { "who": "you", "column": 4 },
        { "who": "me", "column": 2 },
        { "who": "you", "column": 4 },
        { "who": "me", "column": 4 },
        { "who": "you", "column": 5 },
        { "who": "me", "column": 6 },
        { "who": "you", "column": 5 },
        { "who": "me", "column": 1 },
        { "who": "you", "column": 5 },
        { "who": "me", "column": 5 },
        { "who": "you", "column": 3 }]

Where "me" was AI and "you" was I.

It did block twice though.

My final prompt I tested with right now was:

Let's play connect 4. Before your move, explain your strategy concisely. Explain what you must do to make sure that I don't win in the next step, as well as explain what your best strategy would be. Then finally output the column you wish to drop. There are 7 columns. Always respond with JSON of the following format:

type Response ={

      move_history: { who: string; column: number; }[]

      am_i_forced_to_block: boolean;

do_i_have_winning_move: boolean;

      other_considerations: string[];

explanation_for_the_move: string;

column_number: number; }

I start with 4.

ONLY OUTPUT JSON

happypumpkin · on May 14, 2024

Given that it is multimodal, it would be interesting to try it using photographs of a real connect four "board." I would certainly have a much more difficult time making good moves based on JSON output compared to being able to see the game.

mewpmewp2 · on May 14, 2024

True, that's very interesting and should try out. Although at certain point it did draw it out using tokens, but it also maybe that then it's different compared to say an image. Because it generally isn't very good with ascii art or similar.

Edit:

Just tried and it didn't seem to follow the image state at all.

happypumpkin · on May 17, 2024

Since it is also pretty bad with tic tac toe in a text-only format, I tested it with the following prompt:

Lets play tic tac toe. Try hard to win (note that this is a solved game). I will upload images of a piece of paper with the state of the game after each move. You will go first and will play as X. Play by choosing cells with a number 1-9; the cells are in row-major order. I will then draw your move, and my move as O, before sending you the board state as an image. You will respond with another move. You may think out loud to help you play. Note if your move will give you a win. Go.

It failed pretty miserably. First move it played was cell 1, which I think is pretty egregious given that I specified that the game is solved and that the center cell is the best choice (and it isn't like ttt is an obscure game). It played valid moves for the next couple of turns but then missed an opportunity to block me. After I uploaded the image showing my win it tried to keep playing by placing an X over one of my plays and claiming it won in column 1 (it would've won in column 3 if its play had been valid).

itissid · on May 13, 2024

Have you tried replacing the input string with a random but fixed mapping and obfuscate that its chess(like replace the word 'chess' with say, 'an alien ritual practice') and see how it does?

parhamn · on May 13, 2024

Is the test set public?

freediver · on May 13, 2024

Yes, in the repo.

gengelbro · on May 13, 2024

Possible it's in the training set then?

unbrice · on May 13, 2024

Authors note that this is probably the case:

> we wanted to verify whether the model is actually capable of reasoning by building a simulation for a much simpler game - Connect 4 (see 'llmc4.py'). > When asked to play Connect 4, all LLMs fail to do so, even at most basic level. This should not be the case, as the rules of the game are simpler and widely available.

bongodongobob · on May 13, 2024

Wouldn't there have to be historical matches to train on? Tons of chess games out there but doubt there are any connect 4 games. Is there even official notation for that?

My assumption is that chatgpt can play chess because it has studied the games rather than just reading the rules.

mewpmewp2 · on May 13, 2024

Good point, would be interesting to have one public dataset and one hidden as well, just to see how scores compare, to understand if any of it might actually have got to a dataset somewhere.

freediver · on May 13, 2024

I'd be quite surprised if OpenAI took such a niche and small dataset into consideration. Then again...

mewpmewp2 · on May 13, 2024

I would assume it goes over all the public github codebases, but no clue if there's some sort of filtering for filetypes, sizes or amount of stars on a repo etc.

DatoClement · on May 15, 2024

I think specifying chess rules at the beginning of the prompt might help mitigate the problem of illegal moves

mritchie712 · on May 13, 2024

woah, that's a huge leap, any idea why it's that large of a margin?

using it in chat, it doesnt feel that different

whimsicalism · on May 13, 2024

would love if you could do multiple samples or even just resampling and get a boostrapped CI estimate