To be honest, I like that this article tries to perform simple analyses, but find their rationale pretty confusing.
This kind of data is commonly modeled using item response theory (IRT). I suspect that even in data generated by a unidimensional IRT model (which they are arguing against), you might get the results they report, depending on the level of measurement error in the model.
Measurement error is the key here, but is not considered in the article. That + setting an unjustified margin of 20% around the average is very strange. An analogous situation would be criticizing a simple regression, by looking at how many points fall X units above/below the fitted line, without explaining your choice of X.
Totally agree that this is not a fully rigorous analysis, and we do want to dig deeper and try to extend some IRT models to these types of questions.
The main point of this post is to highlight that the most common metric of student performance may not be that useful. Most of the time, students will get their score, the average score, and sometimes a standard deviation as well. As jimhefferon mentioned in a response to a different comment, the conventional wisdom is that two students with the same grade know roughly the same stuff, and that's seeming not to be true.
We're hoping to build some tools here to help instructors give students a better experience by helping them cater to the different groups that are present.
disclaimer: I'm one of the founders of Gradescope.
I agree with your point, that the average likely misses important factors (and think the tagging you guys are implementing looks really cool!).
However, I'd say that the issue is more than having a non-rigorous analysis. It's the wrong analysis for the question your article tries to answer. In the language often used in the analysis of tests, your analyses are essentially examining reliability (how much do student's scores vary on different test items due to "noise"), rather than validity (e.g. how many underlying skills did we test). Or rather, they don't try to separate the two, so cannot make clear conclusions.
I am definitely with you in terms of the goal of the article, and there is a rich history in psychology examining your question (but they do not use the analyses in the article for the reasons above).
You brought a smile to my face. I came here to post this same point.
The piece is kind of making a basic fundamental mistake in measurement, assuming that all variability is meaningful variability.
There are ways of making the argument they're trying to make, but they're not doing that.
Also, sometimes a single overall score is useful. A better analogy than the cockpit analogy they use is clothing sizing. Yes, tailored shirts, based on detailed measurements of all your body parts, fit awesome, but for many people, small, medium, large, x-large, and so forth suffice.
I think there's a lesson here about reinventing the wheel.
I appreciate the goals of the company and wish them the best, but they need a psychometrician or assessment psychologist on board.
I do agree that applying psychometrics would be great, but it's not as simple as it sounds -- the vast majority of work is on multiple choice questions, or binary correct/incorrect. There is some on free response, but much less.
We aren't trying to make a rigorous statement here -- we're trying to draw attention to the fact that the most common metrics do not give much insight into what a student has actually shown mastery of. This is especially important when you consider that the weightings of particular questions are often fairly arbitrary.
I certainly agree that all variability is not meaningful variability, but I'd push back a bit and say that there's meaningful variability in what's shown here. We'll go into more depth and hopefully have something interesting to report.
I've also seen a fair number of comments stating that this is not a surprising result. I'd agree (if you've thought about it), but if you look at what's happening in practice, it's clear that either many people would be surprised by this, or are at least unable to act on it. We're hoping to help with the latter.
IRT modeling doesn't care much whether an item is free response or not, just the scale on which it's scored. Binary and polytomous scoring = IRT model. Continuous scoring = Factor analysis.
If by mentioning free response, you mean students are unlikely to guess the correct answer, even when they don't know it, it's a 2 parameter IRT rather than 3.
This kind of data is commonly modeled using item response theory (IRT). I suspect that even in data generated by a unidimensional IRT model (which they are arguing against), you might get the results they report, depending on the level of measurement error in the model.
Measurement error is the key here, but is not considered in the article. That + setting an unjustified margin of 20% around the average is very strange. An analogous situation would be criticizing a simple regression, by looking at how many points fall X units above/below the fitted line, without explaining your choice of X.