I agree with the general thrust of this, but it's worth noting that the author does _slightly_ better than is typical for LLM-based analyses: they released the dataset of book-labeled posts. You can at least estimate the false-positive rate from that, by sampling the results. (You can't estimate false-negative rate, though.)
Ideally authors attempt to do some sort of validation of the results at the LLM-labeling step and present that, but that rarely happens with these sorts of posts. I think that's pretty telling.
Ideally authors attempt to do some sort of validation of the results at the LLM-labeling step and present that, but that rarely happens with these sorts of posts. I think that's pretty telling.