Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I looked at the GitHub README for the project, which says

> Uses machine learning to classify Twitter accounts as bots or not bots. The default model is 93.53% accurate when classifying bots and 95.32% accurate when classifying non-bots. The fast model is 91.78% accurate when classifying bots and 92.61% accurate when classifying non-bots.

Overall, the default model is correct 93.8% of the time.

Overall, the fast model is correct 91.9% of the time.

How is this accuracy determined? There is no information available explaining how this determination is quantified, nor what the caveats are.



From the code (https://github.com/mkearney/botrnot/blob/master/R/features.R), the percents are # correct / # total, which makes sense.

However, the percents are from the training set; there’s no test/validation set, which is a problem when working with bespoke text data as a feature.


Do we know that no cross validation was done? I couldn't figure out what the dataset is or where it came from just by looking at the repo.


k-fold cross-validation was done (the cv.folds parameter to gbm), but that doesn’t help when the model overfits like hell.


I agree this is probably what happened. From the estimation object I see around a total of 3500 training samples with around 90 predictors. 3-fold cross-validation was done so at each iteration only around 2150 samples were used for training... And it seems none of the samples were used for out-of-sample checking.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: