One approach does use this. You can ask an LLM to explicitly check its own answers by outputting thinking tokens, generating a reward signal if it gets the right answer, and directly updating based on the reward signals. That’s a part of how DeepSeek R1 was trained. It’s better but not perfect, because the thinking process is imperfect. Ultimately the LLM might not know what it doesn’t know.