I'm seriously fed up with all this fact-free AI hype. Whenever an LLM regurgitates training data, it's heralded as the coming of AGI. Whenever it's shown that they can't solve any novel problem, the research is in bad faith (but please make sure to publish the questions so that the next model version can solve them -- of course completely by chance).
Here's a quote from the article:
> How many humans can sit down and correctly work out a thousand Tower of Hanoi steps? There are definitely many humans who could do this. But there are also many humans who can’t. Do those humans not have the ability to reason? Of course they do! They just don’t have the conscientiousness and patience required to correctly go through a thousand iterations of the algorithm by hand. (Footnote: I would like to sit down all the people who are smugly tweeting about this with a pen and paper and get them to produce every solution step for ten-disk Tower of Hanoi.)
In case someone imagines that fancy recursive reasoning is necessary to solve the Towers of Hanoi, here's the algorithm to move 10 (or any even number of) disks from peg A to peg C:
1. Move one disk from peg A to peg B or vice versa, whichever move is legal.
2. Move one disk from peg A to peg C or vice versa, whichever move is legal.
3. Move one disk from peg B to peg C or vice versa, whichever move is legal.
4. Goto 1.
Second-graders can follow that, if motivated enough.
There's now constant, nonstop, obnoxious shouting on every channel about how these AI models have solved the Turing test (one wonders just how stupid these "evaluators" were), are at the level of junior devs (LOL), and actually already have "PhD level" reasoning capabilities.
I don't know who is supposed to be fooled -- we have access to these things, we can try them. One can easily knock out any latest version of GPT-PhD-level-model-of-the-week with a trivial question. Nothing fundamentally changed about that since GPT-2.
The hype and the observable reality are now so far apart that one really has to wonder: Are people this easily gullible? Or do so many people in tech benefit from the hype train that they don't want to rain on the parade?
I could be wrong, but it seems you have misunderstood something here, and you've even quoted the part that you've misunderstood. It isn't that the algorithm for solving the problem isn't known. The LLM knows it, just like you do. It is that the steps of following the algorithm are too verbose if you're just writing them down and trying to keep track of the state of the problem in your head. Could you do that for a large number of disks?
Please do correct me if the misunderstanding is mine.
I feel like practically anybody could solve Tower of Hanoi for any degree of complexity using this algorithm. It’s a four step process that you just repeat over and over.
That's an algorithm to solve it, but you have to describe every move at every step, while doing this on paper or in your head, while keeping track of the state of the game, again, in your head. That is what they're challenging the LLM to do.
You're still misunderstanding. If it is easy, please feel free to demonstrate solving it by telling us what the 1300th step is from working it out in your head.
Yes, the whole "Towers of Hanoi is a bad test case" objection is a non-sequitur here. It would be a significant objection if the machines performed well, but not given the actual outcome - it is as if an alleged chess grandmaster almost always lost against opponents of unexceptional ability.
It is actually worse than that analogy: Towers of Hanoi is a bimodal puzzle, in which players who grasp the general solution do inordinately better than those who do not, and the machines here are performing like the latter.
Lest anyone thinks otherwise, this is not a case of setting up the machines to fail, any more than the chess analogy would be. The choice of Towers of Hanoi leaves it conceivable that they do would well on tough problems, but that is not very plausible and needs to be demonstrated before it can be assumed.
They set it up to fail the moment they ran it with a large number of disks and assumed the models would just continue as if it ran the same simple algorithm in a loop, and the moment they set temperature to 1.
I take your point that the absence of any discussion of the effect of temperature choice or justification for choosing 1 seems to be an issue with the paper (unless it is quite obviously the only rational choice to those working in the field?)
> Second-graders can follow that, if motivated enough.
Try to motivate them sufficiently to do so without error for a large number of disks, I dare you.
Now repeat this experiment while randomly refusing to accept the answer they're most confident in for any given iteration, and pick an answer they're less confident in on their behalf, and insist they still solve it without error.
(To make it equivalent to the researchers running this with temperature set to 1)
> obnoxious shouting on every channel about how these AI models have solved the Turing test (one wonders just how stupid these "evaluators" were)
Huh? Schoolteachers and university professors complaining about being unable to distinguish ChatGPT-written essay answers from student-written essay answers is literally ChatGPT passing the Turing test in real time.
No it's not. The traditional interpretation of the Turing test requires interactivity. That is, the evaluator is allowed to ask questions and will receive a response from both a person and a machine. The idea is that there should be no sequence of questions you can ask that would reliably identify the machine. That's not even close to true for these "AI" systems.
You're right about interactivity, something that I overlooked -- but I think it's nevertheless the case that a large fraction of human interrogators could not distinguish a human from a suitably-system-prompted ChatGPT even over the course of an interactive discussion.
ChatGPT 4.5 was judged to be the human 73% of the time in this RCT study, where human interrogators had 5-minute conversations with a human and an LLM: https://arxiv.org/pdf/2503.23674
This is kind of an irrelevant (and doubtless unoriginal) shower thought here but, if humans are judging the AI to be human much more often than the human, surely that means the AI is not faithfully reproducing human behaviour.
Sure, a non-human's performance "should" be capped at ~50% for a large sample size. I think seeing a much higher percentage, like 73%, indicates systematic error in the interrogator. This -- the fact that humans are not good at detecting genuine human behaviour -- is really a problem in the Turing test itself, but I don't see a good way to solve it.
LLaMa 3.1 with the same prompt "only" managed to be judged human 56% of the time, so perhaps it's actually closer to real human behaviour.
This comes down to the interpretation of the Turing test. Turing's original test actually pitted the two "unknowns" against each other. Put simply, both the human and the computer would try to make you believe they were the person. The objective of the game was to be seen as human, not to be indistinguishable from human.
This is obviously not quite what people understand the Turing test as anymore, and I think that interpretation confusion actually ends up weakening the linked paper. Your thought aptly describes a problem with the paper, but that problem is not present in the Turing test by its original formulation.
It's hard to say what a "bona fide 3-party Turing test" is. The paper even has a section trying to tackle that issue.
I think trying to discuss the minutia of the rules is a path that leads only to madness. The Turing test was always meant to be a philosophical game. The point was to establish a scenario in which a computer could be indistinguishable from a human. Carrying it out in reality in meaningless, unless you're willing to abandon all intuitive morality.
Quite frankly, I find the paper you linked misguided. If it was undertaken by some college students, then it's good practice, but if it was carried out by seasoned professionals they should find something better to do.
The original Turing game was about testing for a male of female player.
If you want to know more about that, or this research, you could try asking AI for a no-fluff summary.
The Transformer architecture and algorithm and matrix multiplication are a bit more involved. It would be hard to keep those inside your chain-of-thought / working memory and still understand what is going on here.
Here's a quote from the article:
> How many humans can sit down and correctly work out a thousand Tower of Hanoi steps? There are definitely many humans who could do this. But there are also many humans who can’t. Do those humans not have the ability to reason? Of course they do! They just don’t have the conscientiousness and patience required to correctly go through a thousand iterations of the algorithm by hand. (Footnote: I would like to sit down all the people who are smugly tweeting about this with a pen and paper and get them to produce every solution step for ten-disk Tower of Hanoi.)
In case someone imagines that fancy recursive reasoning is necessary to solve the Towers of Hanoi, here's the algorithm to move 10 (or any even number of) disks from peg A to peg C:
1. Move one disk from peg A to peg B or vice versa, whichever move is legal.
2. Move one disk from peg A to peg C or vice versa, whichever move is legal.
3. Move one disk from peg B to peg C or vice versa, whichever move is legal.
4. Goto 1.
Second-graders can follow that, if motivated enough.
There's now constant, nonstop, obnoxious shouting on every channel about how these AI models have solved the Turing test (one wonders just how stupid these "evaluators" were), are at the level of junior devs (LOL), and actually already have "PhD level" reasoning capabilities.
I don't know who is supposed to be fooled -- we have access to these things, we can try them. One can easily knock out any latest version of GPT-PhD-level-model-of-the-week with a trivial question. Nothing fundamentally changed about that since GPT-2.
The hype and the observable reality are now so far apart that one really has to wonder: Are people this easily gullible? Or do so many people in tech benefit from the hype train that they don't want to rain on the parade?