Assuming you've read OpenAI's paper released this week? https://cdn.openai.com/p...

Assuming you've read OpenAI's paper released this week?

https://cdn.openai.com/pdf/d04913be-3f6f-4d2b-b283-ff432ef4a...

They attribute these 'compression artefacts' to pre-training, they also reference the original snowballing paper: How Language Model Hallucinations Can Snowball: https://arxiv.org/pdf/2305.13534

They further state that reasoning is no panacea. W hilst you did say: "the models mitigate more and more"

You were replying to my comment which said:

"'Bad' generations early in the output sequence are somewhat mitigatable by injecting self-reflection tokens like 'wait', or with more sophisticated test-time compute techniques."

So our statements there are logically compatible, i.e. you didn't make a statement that contradicts what I said.

"Our error analysis is general yet has specific implications for hallucination. It applies broadly, including to reasoning and search-and-retrieval language models, and the analysis does not rely on properties of next-word prediction or Transformer-based neural networks."

"Search (and reasoning) are not panaceas. A number of studies have shown how language models augmented with search or Retrieval-Augmented Generation (RAG) reduce hallucinations (Lewis et al., 2020; Shuster et al., 2021; Nakano et al., 2021; Zhang and Zhang, 2025). However, Observation 1 holds for arbitrary language models, including those with RAG. In particular, the binary grading system itself still rewards guessing whenever search fails to yield a confident answer. Moreover, search may not help with miscalculations such as in the letter-counting example, or other intrinsic hallucinations"