They attribute these 'compression artefacts' to pre-training, they also reference the original snowballing paper: How Language Model Hallucinations Can Snowball: https://arxiv.org/pdf/2305.13534
They further state that reasoning is no panacea.
W
hilst you did say:
"the models mitigate more and more"
You were replying to my comment which said:
"'Bad' generations early in the output sequence are somewhat mitigatable by injecting self-reflection tokens like 'wait', or with more sophisticated test-time compute techniques."
So our statements there are logically compatible, i.e. you didn't make a statement that contradicts what I said.
"Our error analysis is general yet has specific implications for hallucination. It applies broadly, including to reasoning and search-and-retrieval language models, and the analysis does not rely on properties of next-word prediction or Transformer-based neural networks."
"Search (and reasoning) are not panaceas. A number of studies have shown how language models augmented with search or Retrieval-Augmented Generation (RAG) reduce hallucinations (Lewis et al., 2020; Shuster et al., 2021; Nakano et al., 2021; Zhang and Zhang, 2025). However, Observation 1 holds for arbitrary language models, including those with RAG. In particular, the binary grading system itself still rewards guessing whenever search fails to yield a confident answer. Moreover, search may not help with miscalculations such as in the letter-counting example, or other intrinsic hallucinations"
https://cdn.openai.com/pdf/d04913be-3f6f-4d2b-b283-ff432ef4a...
They attribute these 'compression artefacts' to pre-training, they also reference the original snowballing paper: How Language Model Hallucinations Can Snowball: https://arxiv.org/pdf/2305.13534
They further state that reasoning is no panacea. W hilst you did say: "the models mitigate more and more"
You were replying to my comment which said:
"'Bad' generations early in the output sequence are somewhat mitigatable by injecting self-reflection tokens like 'wait', or with more sophisticated test-time compute techniques."
So our statements there are logically compatible, i.e. you didn't make a statement that contradicts what I said.
"Our error analysis is general yet has specific implications for hallucination. It applies broadly, including to reasoning and search-and-retrieval language models, and the analysis does not rely on properties of next-word prediction or Transformer-based neural networks."
"Search (and reasoning) are not panaceas. A number of studies have shown how language models augmented with search or Retrieval-Augmented Generation (RAG) reduce hallucinations (Lewis et al., 2020; Shuster et al., 2021; Nakano et al., 2021; Zhang and Zhang, 2025). However, Observation 1 holds for arbitrary language models, including those with RAG. In particular, the binary grading system itself still rewards guessing whenever search fails to yield a confident answer. Moreover, search may not help with miscalculations such as in the letter-counting example, or other intrinsic hallucinations"