Sure, a non-human's performance "should" be capped at ~50% for a large sample size. I think seeing a much higher percentage, like 73%, indicates systematic error in the interrogator. This -- the fact that humans are not good at detecting genuine human behaviour -- is really a problem in the Turing test itself, but I don't see a good way to solve it.
LLaMa 3.1 with the same prompt "only" managed to be judged human 56% of the time, so perhaps it's actually closer to real human behaviour.
LLaMa 3.1 with the same prompt "only" managed to be judged human 56% of the time, so perhaps it's actually closer to real human behaviour.