I think this is a good joke on nay-sayers. But if author is here, I would like a clarification if user is picking the next token or the next word? Cause if it is the latter, I think this test is invalid.
The language model generating the candidate answers generates tokens until a full word is produced. The language models picking their answer choose the completion that results in the lowest perplexity independent of the tokenization.
I'd say the test is still not quite valid, and more of in between the original "valid" task and "guess what LLM would say" as suggested in another comment here. The reason is: it might be easier for LLMs to choose the completion out of their own generated variants (1) than the real token distribution.
1. perhaps even out of variants generated by other LLMs