I think this is a good joke on nay-sayers. But if author is here, I would like a...

JoelEinbinder · on Aug 17, 2024

The language model generating the candidate answers generates tokens until a full word is produced. The language models picking their answer choose the completion that results in the lowest perplexity independent of the tokenization.

lostmsu · on Aug 17, 2024

I'd say the test is still not quite valid, and more of in between the original "valid" task and "guess what LLM would say" as suggested in another comment here. The reason is: it might be easier for LLMs to choose the completion out of their own generated variants (1) than the real token distribution.

1. perhaps even out of variants generated by other LLMs