> Is it common practice to use the most frequent words as features ? It looks li...

andreyk · on Jan 24, 2016

As a side note, early on (in the first version of this project) I tried doing something significantly more fancy by taking the most frequent words in each category of the training set that also accounted for a large fraction of the word occurrence in all the emails (so words that have high counts of which at least 10-25% were from a single category). It seems like something like this should be better than just taking the most common words, but either the parameters I chose or my code itself was no good. Just taking a lot of the most frequent words and doing feature selection probably results in fairly similar features, though...