I agree with your point, that the average likely misses important factors (and think the tagging you guys are implementing looks really cool!).
However, I'd say that the issue is more than having a non-rigorous analysis. It's the wrong analysis for the question your article tries to answer. In the language often used in the analysis of tests, your analyses are essentially examining reliability (how much do student's scores vary on different test items due to "noise"), rather than validity (e.g. how many underlying skills did we test). Or rather, they don't try to separate the two, so cannot make clear conclusions.
I am definitely with you in terms of the goal of the article, and there is a rich history in psychology examining your question (but they do not use the analyses in the article for the reasons above).
However, I'd say that the issue is more than having a non-rigorous analysis. It's the wrong analysis for the question your article tries to answer. In the language often used in the analysis of tests, your analyses are essentially examining reliability (how much do student's scores vary on different test items due to "noise"), rather than validity (e.g. how many underlying skills did we test). Or rather, they don't try to separate the two, so cannot make clear conclusions.
I am definitely with you in terms of the goal of the article, and there is a rich history in psychology examining your question (but they do not use the analyses in the article for the reasons above).