Depends on how you’re using the data. There’s a pretty strong correctness signal in the user behavior.
Did they rephrase the question? Probably the first answer was wrong. Did the session end? Good chance the answer was acceptable. Did they ask follow-ups? What kind? Etc.
I'm used to doing the same task 4 or 5 times (different sessions, similar prompts), and most of the time the result is useless or completely wrong. Sometimes I go back and pick the first result, other time none of them, other time a mix of them. I'm wondering how can they extract value from this.
Did they rephrase the question? Probably the first answer was wrong. Did the session end? Good chance the answer was acceptable. Did they ask follow-ups? What kind? Etc.