This is exactly what we think is a fairly common attitude -- thanks for stating it so clearly! It has ramifications both within a single class and when you think about how prerequisite and dependent classes are structured.
How do you think it could be done differently? Student need to be judged for who moves ahead. That is, I have people in Calc I and I have to decide who moves on to Calc II. I can't send the next instructor a poset of their competence. I cannot require that everyone be competent at everything. I wonder what is your proposal?
I do agree that applying psychometrics would be great, but it's not as simple as it sounds -- the vast majority of work is on multiple choice questions, or binary correct/incorrect. There is some on free response, but much less.
We aren't trying to make a rigorous statement here -- we're trying to draw attention to the fact that the most common metrics do not give much insight into what a student has actually shown mastery of. This is especially important when you consider that the weightings of particular questions are often fairly arbitrary.
I certainly agree that all variability is not meaningful variability, but I'd push back a bit and say that there's meaningful variability in what's shown here. We'll go into more depth and hopefully have something interesting to report.
I've also seen a fair number of comments stating that this is not a surprising result. I'd agree (if you've thought about it), but if you look at what's happening in practice, it's clear that either many people would be surprised by this, or are at least unable to act on it. We're hoping to help with the latter.
IRT modeling doesn't care much whether an item is free response or not, just the scale on which it's scored. Binary and polytomous scoring = IRT model. Continuous scoring = Factor analysis.
If by mentioning free response, you mean students are unlikely to guess the correct answer, even when they don't know it, it's a 2 parameter IRT rather than 3.
Totally agree that this is not a fully rigorous analysis, and we do want to dig deeper and try to extend some IRT models to these types of questions.
The main point of this post is to highlight that the most common metric of student performance may not be that useful. Most of the time, students will get their score, the average score, and sometimes a standard deviation as well. As jimhefferon mentioned in a response to a different comment, the conventional wisdom is that two students with the same grade know roughly the same stuff, and that's seeming not to be true.
We're hoping to build some tools here to help instructors give students a better experience by helping them cater to the different groups that are present.
disclaimer: I'm one of the founders of Gradescope.
I agree with your point, that the average likely misses important factors (and think the tagging you guys are implementing looks really cool!).
However, I'd say that the issue is more than having a non-rigorous analysis. It's the wrong analysis for the question your article tries to answer. In the language often used in the analysis of tests, your analyses are essentially examining reliability (how much do student's scores vary on different test items due to "noise"), rather than validity (e.g. how many underlying skills did we test). Or rather, they don't try to separate the two, so cannot make clear conclusions.
I am definitely with you in terms of the goal of the article, and there is a rich history in psychology examining your question (but they do not use the analyses in the article for the reasons above).
I'm one of the founders of Gradescope (the company in the other post).
I see where you're coming from, but I actually think that both of these posts describe ways for students to get more open ended assessment, rather than automatically graded multiple choice questions (i.e. what's typical in MOOCs and Scantrons).
I agree with that, but would you say that the "trend", if there is one, is towards providing that through AI, or more humans?
If I recall, they had computers grade the GMAT essays for at least 10 years, but they had to have a human in the loop because that is the ultimate measure for whether a computer is "correct" in terms of grading an essay.
The trend is towards neither, I'd say. The biggest trend has been towards online automatically graded multiple choice & short answer questions that aren't really open ended.
The automated essay grading stuff typically looks at writing style more than content, but it's true that it's a problem that tons of people have worked on and there's been some cool progress there as well. We're not really working on bringing AI to essay grading ourselves though.
Assessment in education is broken -- instructors spend hours grading, and yet don't get a clear picture of what their students are struggling with. Gradescope lets instructors give out the same paper-based assignments they've always used, but then grade them online, while keeping track of the exact mistakes made by every student on every question. This enables unprecedented data analytics: we can reveal which concepts a student needs help with, or which questions are too difficult. To top it off, instructors finish grading in half the time. We're now applying computer vision to help instructors grade even faster.
Our product has been used to grade over 10 million questions belonging to over 100,000 students. We recently raised a seed round, and are hiring a senior full-stack engineer to join our team of 7. We offer market-rate salary with generous equity. We've got a Rails backend with some React on the frontend.
If you’re interested, please email jobs@gradescope.com.
Assessment in education is broken -- instructors spend hours grading, and yet don't get a clear picture of what their students are struggling with. Gradescope lets instructors give out the same paper-based assignments they've always used, but then grade them online, while keeping track of the exact mistakes made by every student on every question. This enables unprecedented data analytics: we can reveal which concepts a student needs help with, or which questions are too difficult. To top it off, instructors finish grading in half the time.
Our product has been used to grade over 8 million questions belonging to over 100,000 students. We recently raised a seed round, and are hiring a full-stack engineer to join our team of 6. We offer market-rate salary with generous equity. We've got a Rails backend with some React on the frontend.
If you’re interested, please email jobs@gradescope.com.
Assessment in education is broken -- instructors spend hours grading, and yet don't get a clear picture of what their students are struggling with. Gradescope lets instructors give out the same paper-based assignments they've always used, but then grade them online, while keeping track of the exact mistakes made by every student on every question. This enables unprecedented data analytics: we can reveal which concepts a student needs help with, or which questions are too difficult. To top it off, instructors finish grading in half the time.
Our product has been used to grade over 3.5 million pages of work belonging to over 30,000 students. We’ve raised a seed round, and are making our first full-time engineering hire to join the founding team of two PhDs and a professor from Berkeley CS. Over the next few months, we’re looking to expand our user base and roll out advanced features including autograding, additional analytics, and more. We offer market-rate salary with generous equity.
We’re currently looking for a full-stack engineer. We've got a Rails backend with some React on the frontend.
If you’re interested, please email jobs@gradescope.com
Gradescope lets instructors give out the same paper-based assignments they've always used, but then grade them online, while keeping track of the exact mistakes made by every student on every question. This enables unprecedented data analytics: for example, we can reveal which concepts a student needs help with, or which questions are too difficult. To top it off, instructors finish grading in half the time.
Our product has been used to grade over 3.5 million pages of work belonging to over 30,000 students. We’ve raised a seed round, and are making our first full-time engineering hire to join the founding team of two PhDs and professor from Berkeley CS. Over the next few months, we’re looking to expand our user base and roll out advanced features including autograding, analytics, and more. We offer market-rate salary with generous equity.
We’re currently looking for a full-stack engineer. We've got a Rails backend with some React on the frontend.
If you’re interested, please email jobs@gradescope.com