A real user will be worse … but that’s kinda the point.
The most valuable thing you learn in usability/research is not if your experience works, but the way it’ll be misinterpreted, abused, and bent to do things it wasn’t designed to.
You seem to disagree. Here's an interesting study where the researchers used an OpenAI-LLM-based tool to grade student papers and by grading them 10 times in a row, they got vastly different results:
Quote: "The results reveal significant shortcomings: The tool’s numerical grades and qualitative feedback are often random and do not improve even when its suggestions are incorporated."