> Is it common practice to use the most frequent words as features ? It looks like they don't carry much information, by definition.
The common practice with a small-ish dataset is to use e.g. the top 10k or 20k most frequent words, but filter out the top 50-100 so most frequent words, as those indeed do not carry much information. A commonly used weighting scheme is TF-IDF (https://en.wikipedia.org/wiki/Tf%E2%80%93idf), which comes included in Keras.
Anyway, this is a cool ML starter project. Keras makes it really easy to do this sort of fast experimentation with a range of different neural networks models.
As a side note, early on (in the first version of this project) I tried doing something significantly more fancy by taking the most frequent words in each category of the training set that also accounted for a large fraction of the word occurrence in all the emails (so words that have high counts of which at least 10-25% were from a single category). It seems like something like this should be better than just taking the most common words, but either the parameters I chose or my code itself was no good. Just taking a lot of the most frequent words and doing feature selection probably results in fairly similar features, though...
The common practice with a small-ish dataset is to use e.g. the top 10k or 20k most frequent words, but filter out the top 50-100 so most frequent words, as those indeed do not carry much information. A commonly used weighting scheme is TF-IDF (https://en.wikipedia.org/wiki/Tf%E2%80%93idf), which comes included in Keras.
Anyway, this is a cool ML starter project. Keras makes it really easy to do this sort of fast experimentation with a range of different neural networks models.