A priori getting an LLM to recognise equivocal evidence is an interesting question. Shedloads of p-jacked data could outweigh the crucial one which says we don't know. So it goes to things like modelling citation depth and trust and reputation.
I would worry well written flat earth inputs would weigh equally to simple physics "that's wrong" and then you'd get to "what do we know" as a false signal alongside the necessary "we just don't know" true signals.
Maybe the test is how well an LLM equivocates on things we have high certainty on like "is anybody out there" rather than "do masks work" which is a bit of a hot mess.
I would worry well written flat earth inputs would weigh equally to simple physics "that's wrong" and then you'd get to "what do we know" as a false signal alongside the necessary "we just don't know" true signals.
Maybe the test is how well an LLM equivocates on things we have high certainty on like "is anybody out there" rather than "do masks work" which is a bit of a hot mess.