I looked at the GitHub README for the project, which says *> Uses machine learni...

minimaxir · on March 15, 2018

From the code (https://github.com/mkearney/botrnot/blob/master/R/features.R), the percents are # correct / # total, which makes sense.

However, the percents are from the training set; there’s no test/validation set, which is a problem when working with bespoke text data as a feature.

tysonzni · on March 15, 2018

Do we know that no cross validation was done? I couldn't figure out what the dataset is or where it came from just by looking at the repo.

minimaxir · on March 15, 2018

k-fold cross-validation was done (the cv.folds parameter to gbm), but that doesn’t help when the model overfits like hell.

gerty · on March 15, 2018

I agree this is probably what happened. From the estimation object I see around a total of 3500 training samples with around 90 predictors. 3-fold cross-validation was done so at each iteration only around 2150 samples were used for training... And it seems none of the samples were used for out-of-sample checking.