Watson correctly knew 84% of the answers (26 of 31). It actually answered 74% (23 of 31).
It didn’t know the correct response or was not confident enough to answer 16% of the time (5 of 31). Nobody at all could give an answer to 6% of all questions (2 of 31).
If the two questions we know nobody could answer are excluded, Watson knew the right answer 90% of the time (26 of 29).
Do individual Jeopardy contestants correctly know more than 84% of the answers? Two questions were left unanswered correctly by everyone, 93% (29 of 31) is already the upper bound for the human performance in this round, that’s not very far from 84%.
Looking at the six rounds of Jeopardy the J! Archive has of games with Ken Jennings and Brad Rutter [0] we see that the average upper bound in those six rounds was 87% or 26 questions (of 30). The minimum was 23 and the maximum 28 (of 30). That’s, again, very much an upper bound.
Looking at all this, I’m pretty confident that Watson would be doing well even if it had human reaction times. (This is how I would change the game: Within 200ms or so – whatever the human reaction time on a Jeopardy buzzer is — after the buzzers are open, the player to answer is randomly selected from all who managed to buzz in. After those 200ms everything stays the same. Players are also not punished for buzzing in too early. This is a minimal change to the game, not a completely different game which I think is important.)
Maybe simpler version of your proposed rule change: Buzzing in before the buzzers are activated is just treated as buzzing in at the exact moment the buzzers are activated. (And, as you say, break ties randomly.)
With humans, you don't know how many correct responses they knew. You only know how many triple stumpers there were and how many incorrect responses a human gave. When two or three human champions are playing, it's unlikely for one of them to buzz in first with the consistency that Watson was able to. Hypothetically, if two humans A and B both knew the same 95% of correct responses, you would probably see something like A buzzing in 30% of the time and B 70%. You couldn't possibly determine how many correct responses either A or B knew.
I didn’t try to find that out, it’s, as you say, impossible to find out. Triple stumpers set an upper boundary for the humans, the theoretical maximum. That’s what I calculated. About three triple stumpers per round seem to be the norm, even among the best, that puts the upper boundary – the theoretical maximum – at 90%. The humans might still be worse than 90% but they are definitely not better (when there are three triple stumpers).
It didn’t know the correct response or was not confident enough to answer 16% of the time (5 of 31). Nobody at all could give an answer to 6% of all questions (2 of 31).
If the two questions we know nobody could answer are excluded, Watson knew the right answer 90% of the time (26 of 29).
Do individual Jeopardy contestants correctly know more than 84% of the answers? Two questions were left unanswered correctly by everyone, 93% (29 of 31) is already the upper bound for the human performance in this round, that’s not very far from 84%.
Looking at the six rounds of Jeopardy the J! Archive has of games with Ken Jennings and Brad Rutter [0] we see that the average upper bound in those six rounds was 87% or 26 questions (of 30). The minimum was 23 and the maximum 28 (of 30). That’s, again, very much an upper bound.
Looking at all this, I’m pretty confident that Watson would be doing well even if it had human reaction times. (This is how I would change the game: Within 200ms or so – whatever the human reaction time on a Jeopardy buzzer is — after the buzzers are open, the player to answer is randomly selected from all who managed to buzz in. After those 200ms everything stays the same. Players are also not punished for buzzing in too early. This is a minimal change to the game, not a completely different game which I think is important.)
[0] http://www.j-archive.com/showplayer.php?player_id=7206