We are pretty certain that humans can reason, yet they are sometimes wrong. Even if you give them the same problem over and over again with slight variations.
LLMs get things wrong due to different factors than humans (humans lose focus, LLMs have randomness applied when sampling their responses to improve results). But clearly we have to choose a goal somewhat below 100% if we want a test that doesn't conclude that humans are incapable of reasoning.
The difference is we _know_ that LLMs are fancy stochastic models, we don't know that they're capable of reasoning, and the null hypothesis is that they're not (because we know what they _are_ - we built them) - any "reasoning" is an emergent property of the system, not something we built them to do. In that case, evidence they're not reasoning - evidence they're stochastic parrots doing a performance of reasoning - weighs heavier, because the performance of reasoning fits into what we know they can do, whereas genuine reasoning would be something new to the model.
There's deeper philosophical questions about what reasoning actually _is_, and LLMs have made those sharper, because they've shown it's clearly possible for a complex statistical model to generate words that look like reasoning, but the question is whether there's a difference between what they're doing and what humans are doing, and evidence that they're _not_ reasoning - evidence that they're just generating words in specific orders - weighs heavily against them.
We haven't coded LLMs to be stochastic models, we coded them to predict text with any method gradient decent finds on a transformer architecture. That's not exactly the same.
But more importantly, if you want to show that LLMs can't reason you obviously have to use a test that when applied to humans would show that humans can reason. Otherwise your test isn't testing reasoning but something more strict.
LLMs get things wrong due to different factors than humans (humans lose focus, LLMs have randomness applied when sampling their responses to improve results). But clearly we have to choose a goal somewhat below 100% if we want a test that doesn't conclude that humans are incapable of reasoning.