I'm seriously fed up with all this fact-free AI hype. Whenever an LLM regurgitat...

munksbeer · 2025-06-09T14:05:32 1749477932

I could be wrong, but it seems you have misunderstood something here, and you've even quoted the part that you've misunderstood. It isn't that the algorithm for solving the problem isn't known. The LLM knows it, just like you do. It is that the steps of following the algorithm are too verbose if you're just writing them down and trying to keep track of the state of the problem in your head. Could you do that for a large number of disks?

Please do correct me if the misunderstanding is mine.

emp17344 · 2025-06-09T16:05:12 1749485112

I feel like practically anybody could solve Tower of Hanoi for any degree of complexity using this algorithm. It’s a four step process that you just repeat over and over.

munksbeer · 2025-06-09T18:25:35 1749493535

That's an algorithm to solve it, but you have to describe every move at every step, while doing this on paper or in your head, while keeping track of the state of the game, again, in your head. That is what they're challenging the LLM to do.

I challenge you to do that for any complexity.

emp17344 · 2025-06-10T11:34:54 1749555294

It’s four steps. It’s not rocket science.

munksbeer · 2025-06-10T20:05:59 1749585959

You're still misunderstanding. If it is easy, please feel free to demonstrate solving it by telling us what the 1300th step is from working it out in your head.

mannykannot · 2025-06-09T11:44:26 1749469466

Yes, the whole "Towers of Hanoi is a bad test case" objection is a non-sequitur here. It would be a significant objection if the machines performed well, but not given the actual outcome - it is as if an alleged chess grandmaster almost always lost against opponents of unexceptional ability.

It is actually worse than that analogy: Towers of Hanoi is a bimodal puzzle, in which players who grasp the general solution do inordinately better than those who do not, and the machines here are performing like the latter.

Lest anyone thinks otherwise, this is not a case of setting up the machines to fail, any more than the chess analogy would be. The choice of Towers of Hanoi leaves it conceivable that they do would well on tough problems, but that is not very plausible and needs to be demonstrated before it can be assumed.

vidarh · 2025-06-09T11:48:51 1749469731

They set it up to fail the moment they ran it with a large number of disks and assumed the models would just continue as if it ran the same simple algorithm in a loop, and the moment they set temperature to 1.

mannykannot · 2025-06-09T12:21:10 1749471670

I take your point that the absence of any discussion of the effect of temperature choice or justification for choosing 1 seems to be an issue with the paper (unless it is quite obviously the only rational choice to those working in the field?)

vidarh · 2025-06-09T11:45:14 1749469514

> Second-graders can follow that, if motivated enough.

Try to motivate them sufficiently to do so without error for a large number of disks, I dare you.

Now repeat this experiment while randomly refusing to accept the answer they're most confident in for any given iteration, and pick an answer they're less confident in on their behalf, and insist they still solve it without error.

(To make it equivalent to the researchers running this with temperature set to 1)

akoboldfrying · 2025-06-09T10:51:44 1749466304

> obnoxious shouting on every channel about how these AI models have solved the Turing test (one wonders just how stupid these "evaluators" were)

Huh? Schoolteachers and university professors complaining about being unable to distinguish ChatGPT-written essay answers from student-written essay answers is literally ChatGPT passing the Turing test in real time.

delusional · 2025-06-09T11:16:40 1749467800

No it's not. The traditional interpretation of the Turing test requires interactivity. That is, the evaluator is allowed to ask questions and will receive a response from both a person and a machine. The idea is that there should be no sequence of questions you can ask that would reliably identify the machine. That's not even close to true for these "AI" systems.

akoboldfrying · 2025-06-09T11:37:01 1749469021

You're right about interactivity, something that I overlooked -- but I think it's nevertheless the case that a large fraction of human interrogators could not distinguish a human from a suitably-system-prompted ChatGPT even over the course of an interactive discussion.

ChatGPT 4.5 was judged to be the human 73% of the time in this RCT study, where human interrogators had 5-minute conversations with a human and an LLM: https://arxiv.org/pdf/2503.23674

lostmsu · 2025-06-09T13:31:37 1749475897

Shameless self-plug: You can try a two-way variant at https://trashtalk.borg.games/ (also have to guess relative ELO)

It would be surprising if you won't quickly learn to win.

Joeboy · 2025-06-09T11:52:55 1749469975

This is kind of an irrelevant (and doubtless unoriginal) shower thought here but, if humans are judging the AI to be human much more often than the human, surely that means the AI is not faithfully reproducing human behaviour.

akoboldfrying · 2025-06-09T12:21:38 1749471698

Sure, a non-human's performance "should" be capped at ~50% for a large sample size. I think seeing a much higher percentage, like 73%, indicates systematic error in the interrogator. This -- the fact that humans are not good at detecting genuine human behaviour -- is really a problem in the Turing test itself, but I don't see a good way to solve it.

LLaMa 3.1 with the same prompt "only" managed to be judged human 56% of the time, so perhaps it's actually closer to real human behaviour.

delusional · 2025-06-09T12:17:40 1749471460

This comes down to the interpretation of the Turing test. Turing's original test actually pitted the two "unknowns" against each other. Put simply, both the human and the computer would try to make you believe they were the person. The objective of the game was to be seen as human, not to be indistinguishable from human.

This is obviously not quite what people understand the Turing test as anymore, and I think that interpretation confusion actually ends up weakening the linked paper. Your thought aptly describes a problem with the paper, but that problem is not present in the Turing test by its original formulation.

akoboldfrying · 2025-06-09T12:26:23 1749471983

If you're referring to the paper I linked to, their experiments use bona fide 3-party Turing tests as per Turing's original "Imitation Game".

delusional · 2025-06-09T15:06:05 1749481565

It's hard to say what a "bona fide 3-party Turing test" is. The paper even has a section trying to tackle that issue.

I think trying to discuss the minutia of the rules is a path that leads only to madness. The Turing test was always meant to be a philosophical game. The point was to establish a scenario in which a computer could be indistinguishable from a human. Carrying it out in reality in meaningless, unless you're willing to abandon all intuitive morality.

Quite frankly, I find the paper you linked misguided. If it was undertaken by some college students, then it's good practice, but if it was carried out by seasoned professionals they should find something better to do.

akoboldfrying · 2025-06-09T22:43:56 1749509036

I only read the abstract, but if the body bears it out and there were decent sample sizes, I think it's interesting and worthwhile.

> The Turing test was always meant to be a philosophical game.

Says who? Not Alan Turing.

absummer · 2025-06-09T11:48:25 1749469705

The original Turing game was about testing for a male of female player.

If you want to know more about that, or this research, you could try asking AI for a no-fluff summary.

The Transformer architecture and algorithm and matrix multiplication are a bit more involved. It would be hard to keep those inside your chain-of-thought / working memory and still understand what is going on here.

delusional · 2025-06-09T12:18:49 1749471529

> If you want to know more about that, or this research, you could try asking AI for a no-fluff summary.

Or I could just read it. With my human eyes. It's like a single page.

amelius · 2025-06-09T11:09:46 1749467386

It's a Turning test with human-prefiltered responses at best.