On one hand, some of these results are impressive; on the other, the illegal moves count is alarming - it suggests no reasoning ability as there should never be an illegal move? I mean, how could a violation of a fairly basic game (from a rules perspective) be acceptable in assigning any 'outcome' to a model other than failure?
Agreed, this is what makes evaluating this very hard. A 1700 Elo chess player would never make an illegal move, let alone have 12% illegal moves.
So from the model's perspective, we have at the same time display of both brilliancy (most 1700 chess players would not be able to solve as many puzzles by looking just at the FEN notation) and on the other side complete lack of any understanding of what is it trying to do from a fundamental, human-reasoning level.
That's because LLM does not reason. For me, as a layman, that seems strange that they don't wire some kind of Prolog engine to fill the gap, (like they wired Python to fill the gap in arithmetic) but probably it's not that easy.
Prolog doesn’t reason either, it does a simple brute force search over all possible states of your code and if that’s not fast enough it can table (cache, memoize) previous states.
People build reasoning engines from it, in the same way they do with Python and LISPs.
I mean that it does not follow basic logic rules when constructing its thoughts. For many tasks they'll get it right, however it's not that hard to find a task for which LLM will yield obviously logically wrong answer. That would be impossible for human with basic reasoning.
I disagree, but I don’t have a cogent argument yet. So I can’t really refute you.
What I can say is, I think there’s a very important disagreement here and it divides nerds into two camps. The first think LLMs can reason, the second don’t.
It’s very important to resolve this debate, because if the former are correct then we are likely very close to AGI historically speaking (<10 years). If not, then this is just a stepwise improvement and we will now plateaux until the next level of sophistication of model or computer power etc is achieved.
I think a lot of very smart people are in the second camp. But they are biased by their overestimation of human cognition. And that bias might be causing them to misjudge the most important innovation in history. An innovation that will certainly be more impactful than the steam engine and may be more dangerous than the atomic bomb.
We should really resolve this argument asap so we can all either breathe a sigh of relief or start taking the situation very very seriously.
I'm actually in the first camp. For I believe that our brains is really LLM on steroids and logic rules are just in our "prompt".
What we need is a LLM that will iterate over its output until it feels that it's correct. Right now LLM output is like random thought in my mind. Which might be true or not. Before writing forum post I'd think it twice. And may be I'll rewrite the post before submitting it. And when I'm solving a complex problem, it might take weeks and thousands of iterations. Even reading math proof might take a lot of effort. LLM should learn to do it. I think that's the key to imitating human intelligence.
my guess is -- the probabilistic engine does sequence variation and it just will not do anything else.. so a simple A->B sort of logic is elusive at a deep level; secondly the adaptive and very broad kinds of questions and behaviors it handles, also make it difficult to write logic that could correct defective answers to simple logic.
I wasn't impressed in the first 5 minutes of using it but it is quite impressive after 2 solid hours of random topics.
Much faster for sure but I have also not had anything give an error in python with jupyter. Usually you could only stray so far with more obscure python libraries before it starts producing errors.
That much better than 4 in chess is pretty shocking in a great way.
I tried playing against the model, it didn't do well in terms of blocking my win.
However it feels like it might be possible to make it try to think ahead in terms of making sure that all the threats are blocked by prompting well.
Maybe that could lead to somewhere, where it will explain its reasoning first?
This prompt worked for me to get it to block after I put 3 in the 4th column. It otherwise didn't
Let's play connect 4. Before your move, explain your strategy concisely. Explain what you must do to make sure that I don't win in the next step, as well as explain what your best strategy would be. Then finally output the column you wish to drop. There are 7 columns.
Let's play connect 4. Before your move, explain your strategy concisely. Explain what you must do to make sure that I don't win in the next step, as well as explain what your best strategy would be. Then finally output the column you wish to drop. There are 7 columns. Always respond with JSON of the following format:
Given that it is multimodal, it would be interesting to try it using photographs of a real connect four "board." I would certainly have a much more difficult time making good moves based on JSON output compared to being able to see the game.
True, that's very interesting and should try out. Although at certain point it did draw it out using tokens, but it also maybe that then it's different compared to say an image. Because it generally isn't very good with ascii art or similar.
Edit:
Just tried and it didn't seem to follow the image state at all.
Since it is also pretty bad with tic tac toe in a text-only format, I tested it with the following prompt:
Lets play tic tac toe. Try hard to win (note that this is a solved game). I will upload images of a piece of paper with the state of the game after each move. You will go first and will play as X. Play by choosing cells with a number 1-9; the cells are in row-major order. I will then draw your move, and my move as O, before sending you the board state as an image. You will respond with another move. You may think out loud to help you play. Note if your move will give you a win. Go.
It failed pretty miserably. First move it played was cell 1, which I think is pretty egregious given that I specified that the game is solved and that the center cell is the best choice (and it isn't like ttt is an obscure game). It played valid moves for the next couple of turns but then missed an opportunity to block me. After I uploaded the image showing my win it tried to keep playing by placing an X over one of my plays and claiming it won in column 1 (it would've won in column 3 if its play had been valid).
Have you tried replacing the input string with a random but fixed mapping and obfuscate that its chess(like replace the word 'chess' with say, 'an alien ritual practice') and see how it does?
> we wanted to verify whether the model is actually capable of reasoning by building a simulation for a much simpler game - Connect 4 (see 'llmc4.py').
> When asked to play Connect 4, all LLMs fail to do so, even at most basic level. This should not be the case, as the rules of the game are simpler and widely available.
Wouldn't there have to be historical matches to train on? Tons of chess games out there but doubt there are any connect 4 games. Is there even official notation for that?
My assumption is that chatgpt can play chess because it has studied the games rather than just reading the rules.
Good point, would be interesting to have one public dataset and one hidden as well, just to see how scores compare, to understand if any of it might actually have got to a dataset somewhere.
I would assume it goes over all the public github codebases, but no clue if there's some sort of filtering for filetypes, sizes or amount of stars on a repo etc.
https://github.com/kagisearch/llm-chess-puzzles?tab=readme-o...