Don't do a least squares line. That doesn't help. In the first plot, you'll see that in general:
x < mean => y > x
x > mean => y < x
If the scores are normalized. Regression to the mean is that most people move towards the mean in subsequent games/attempts/whatever.
But I fail to see what has changed in the analysis -- a and b are both just supposed to be samples from the same distribution, right?
Not at all. b is not independent of a, thats the whole point of regression to the mean. If you take ordered pairs where there is no connection between a and b, then you won't get any regression to the mean, you'll get points essentially randomly placed on the plane.
Unfortunately what is really tripping me up is when we draw that line. I get the principle of regression to the mean, and why it occurs. What I don't get is this particular manifestation of it.
Fair point about "no connection between a and b".
What I should have said was something like: Why is it important that a come before b chronologically? If we were mistaken, and we thought that b came first, then what we would be seeing is "progression from the mean".
Does the concept of regression to the mean depend on the chronology of events? That would be weird -- most probability doesn't, right?
And I guess I made an assumption about the situation you described. If the students were all answering in identical random ways, then you'd see what I describe. I think this part of the wikipedia article describes it well:
"Consider a simple example: a class of students takes a 100-item true/false test on a subject. Suppose that all students choose randomly on all questions. Then, each student’s score would be a realization of one of a set of i.i.d. random variables, with a mean of 50. Naturally, some students will score substantially above 50 and some substantially below 50 just by chance. If one takes only the top scoring 10% of the students and gives them a second test on which they again guess on all items, the mean score would again be expected to be close to 50. Thus the mean of these students would “regress” all the way back to the mean of all students who took the original test. No matter what a student scores on the original test, the best prediction of their score on second test is 50."
So, to your question, why is time important? Its important in the sense that you need the first test to determine who your "high flyers" are for the second experiment.
"If you take ordered pairs where there is no connection between a and b, then you won't get any regression to the mean, you'll get points essentially randomly placed on the plane."
This is exactly reversed. If A and B are perfectly correlated, then you will have no regression to the mean. If they are perfectly independent, then you will have full regression to the mean. If they are only partially correlated, then you will have only partial regression to the mean.
(This is easy to see if you run a simulation of each case.)
x < mean => y > x
x > mean => y < x
If the scores are normalized. Regression to the mean is that most people move towards the mean in subsequent games/attempts/whatever.
But I fail to see what has changed in the analysis -- a and b are both just supposed to be samples from the same distribution, right?
Not at all. b is not independent of a, thats the whole point of regression to the mean. If you take ordered pairs where there is no connection between a and b, then you won't get any regression to the mean, you'll get points essentially randomly placed on the plane.