Organizing My Emails with a Neural Net

tyingq · on Jan 24, 2016

Popfile is worth looking at if this was interesting to you.

It is a general purpose naïve bayesian email classifier that you can integrate with almost any email system.

They took some of the concepts in the article mentioned here and expanded on them a bit.

For example, they have the idea of "pseudowords"[1] so that you're working with more than just the words in the email. Like html:td for example...it expands to the number html table cells in an email, which might help with choosing a bucket.

[1]http://getpopfile.org/docs/faq:pseudowords

jgrahamc · on Jan 24, 2016

Thanks for mentioning POPFile, good to hear my old code mentioned on Hacker News. I wrote the original code for POPFile in Visual Basic(!) in 2001 (it was called AutoFile back then) as I wrote it initially for my own use without Microsoft Outlook. I got the idea of using Naive Bayes for my email classification after seeing the ifile extension to exmh(1). ifile dates to 1996.

I rewrote POPFile in Perl and made it work with generic POP3 servers (hence the name) and later with IMAP and more. It is still actively maintained by a group of people I've never met(!) with the last release in December: http://getpopfile.org/

(1) http://qwone.com/~jason/papers/ifile00.pdf

andreyk · on Jan 24, 2016

Nice, thanks for the link. I did find a few research papers and the like on this topic, but none of those seemed to have very interesting approaches/results. But this looks quite interesting. It would fun to compare Popfile to my simple approach with a publicly available dataset (such as Enron), perhaps.

Esau · on Jan 24, 2016

CRM114 in an interesting mail sorting tool, although I don't think it has been updated in years:

http://crm114.sourceforge.net/

jorangreef · on Jan 24, 2016

Orthogonal sparse bigrams are another interesting tokenization method, the authors of this paper tested it out against CRM114 with good results:

http://www.siefkes.net/ie/winnow-spam.pdf

Lanzaa · on Jan 25, 2016

Orthogonal sparse bigrams (OSBs) have been implemented in CRM114 since the publication of that paper.

grinich · on Jan 23, 2016

This is really awesome-- I work at Nylas and would love to turn this into a plugin for N1 (https://nylas.com/n1). If the author's hanging out in this thread, please email me :)

burkesquires · on Jan 23, 2016

As a Nylas user...please do!

andreyk · on Jan 24, 2016

Done! :)

thibauts · on Jan 23, 2016

Is it common practice to use the most frequent words as features ? It looks like they don't carry much information, by definition. As a first naive approach, I'd rank the words by the inverse of how many categories they appear in, factor in the overall frequency, with a weight or something more clever and then take the top N.

Well, thinking more about it leads me to tf-idf and naive bayes (of course), at which point you pretty much already have a classifier. So it seems feature selection is learning in itself and defines the maximum accuracy you'll be able to reach ? This is border philosophical but I'd love to read more about these matters. Pointers welcome !

_ntka · on Jan 24, 2016

> Is it common practice to use the most frequent words as features ? It looks like they don't carry much information, by definition.

The common practice with a small-ish dataset is to use e.g. the top 10k or 20k most frequent words, but filter out the top 50-100 so most frequent words, as those indeed do not carry much information. A commonly used weighting scheme is TF-IDF (https://en.wikipedia.org/wiki/Tf%E2%80%93idf), which comes included in Keras.

Anyway, this is a cool ML starter project. Keras makes it really easy to do this sort of fast experimentation with a range of different neural networks models.

andreyk · on Jan 24, 2016

As a side note, early on (in the first version of this project) I tried doing something significantly more fancy by taking the most frequent words in each category of the training set that also accounted for a large fraction of the word occurrence in all the emails (so words that have high counts of which at least 10-25% were from a single category). It seems like something like this should be better than just taking the most common words, but either the parameters I chose or my code itself was no good. Just taking a lot of the most frequent words and doing feature selection probably results in fairly similar features, though...

cJ0th · on Jan 24, 2016

This is very impressive in itself but what is the practical use? I mean you probably could get (almost) the same result with much less work. For instance, "Academic" is every mail send from an .edu domain, "financial" is everything that comes from your bank, "personal" mails come from a sender that is in your personal address book and so on.

In short, there are "good enough" rules that require much less processing.

amelius · on Jan 24, 2016

> I would really like it if gmail indeed had such a machine-learned approach to suggesting a category for each email for one-click email organizing

Yes, but I guess you need to train it first. And if you have been bad at categorizing in the first place, you will start with bad training data.

bunderbunder · on Jan 23, 2016

That confusion matrix makes Edward Tufte so very, very sad.

How on earth am I to tell from that visualization how often it mislabels financial emails as personal? Eyedropper the colors and hope the values in the shades of blue follow a linear scale?

rprospero · on Jan 23, 2016

As someone who regularly encounters these types of diagrams, I guess that I'm too close to the issue, because I'm not seeing the problem. To answer your question, financial e-mails are more likely to be mislabelled as personal than as professional, but not as often as it labels them correctly.

A quick curosry glance at the central diagonal tells me that Finance, Personal/Programming, Professional/EPFL, and Group work are the categories of e-mail which are most likely to be categorized incorrectly by the software. Looking at the columns tells me that the Academic and Personal are going to have the most misfiled messages in them.

jamessb · on Jan 24, 2016

> To answer your question, financial e-mails are more likely to be mislabelled as personal than as professional, but not as often as it labels them correctly.

Yes, but how likely is each mislabelling? There is no scale to indicate how the colors map onto probabilities.

As well as providing a scale, it would be helpful to make the heatmap an annotated heatmap, in which each square is labelled with the corresponding value (perhaps with values below some threshold omitted to reduce clutter).

Example: https://web.stanford.edu/~mwaskom/software/seaborn/_images/s...

(from the Seaborn docs: https://web.stanford.edu/~mwaskom/software/seaborn/generated...)

Edit: Consider another question - If an email has the true label 'Financial', is it more likely to be mislabeled or correctly labelled? I can guess, but without more knowledge of the scale I can't be certain.

vidarh · on Jan 24, 2016

If this was a formal paper, and if the code hadn't been freely available, I'd have agreed with you.

As it is, the specifics of how it mislabels his e-mail is not all that interesting to me, as I don't have access to his e-mail and so the specific numbers are pretty much irrelevant.

andreyk · on Jan 24, 2016

I agree with you it would be more sane to have a scale, but as others commented I included it more to have a rough picture of the results than something you can get a lot of specific conclusions from. To be honest for some reason the scale showed up incorrectly right in the middle of the matrix, so I just decided to omit it and leave it intuitively clear that dark blue=a lot.

ryanmonroe · on Jan 24, 2016

>I would really like it if gmail indeed had such a machine-learned approach to suggesting a category for each email for one-click email organizing

I don't think you can create custom categories but Google's inbox.google.com does this

syllogism · on Jan 23, 2016

Why not just use all the words?

vidarh · on Jan 24, 2016

There are two main reasons to try to pare it down: The processing time, and that if you manage to pare it down to the words that are actually relevant, then you reduce the chance of over-fitting to specific features that are actually irrelevant (e.g. different frequencies of words like "is" is quite likely to be irrelevant; but of course you do this kind of filtering at your peril - what might seem irrelevant could also turn out to be highly significant in context so it's hard to get right)

syllogism · on Jan 24, 2016

In my experience it's usually best to start with all the words. If you use a decent implementation that supports sparse vectors it's no problem, certainly not for these sorts of data sizes.

Usually you'll end up with a frequency threshold, but it's usually best to trim at the very low end --- like, words occurring 5 or fewer times. Further over-fitting can be controlled with regularization and parameter averaging.

andreyk · on Jan 24, 2016

This was my reasoning as well - at some point using more words does not help (but takes more time), and with the long tail distribution of words you could potentially go up to hundreds of thousands and extremely overfit.

binalpatel · on Jan 24, 2016

Glad you came to the conclusion that I always seem to come to, it's all about the features. Engineer great features and you'll get great results, bad features and you'll have bad results.

misiti3780 · on Jan 24, 2016

I see his conclusion is deep learning doesnt work here - but it might be interesting to incorporate word2vec features into this and see if the performance doesnt increase (or at least hover around 94%)

andreyk · on Jan 24, 2016

I ran one experiment with an Embedding layer (second one in the Deep Learning Is No Good Here section), which I assumed to be analagous to word2vec. I hoped it would help as well so was rather dissapointed it did not seem to be so. Perhaps I should have tried a few more configurations, though.

nicklo · on Jan 24, 2016

An Embedding layer by itself will try to learn word vectors from the data you train it on (error gradient will propagate back to embedding layer, update the vector weights). Word2vec and word vectors are only really useful with tons of training data to learn good embeddings. I believe OP is referring to using Google's pre-trained word-vectors (trained on massive amounts of text [100 Billion words]). Can be found here: https://code.google.com/p/word2vec/

This is pretty straightforward to implement in keras, you just need to supply pre-trained word-vectors weights to your embedding layer.

  Embedding(vocab_size, 300, weights = [word2vec_weights])

Where 'word2vec_weights' is a numpy matrix with shape (vocab_size, 300).

andreyk · on Jan 24, 2016

Ah, I honestly assumed it would use an existing set of weights such as from word2vec. Though I now realize that does not make sense since you can also embed images or really any real numbered inputs. I will give this a try.

nicklo · on Jan 27, 2016

Cool! Let me know if you run into any issues.

artagnon · on Jan 24, 2016

Cool project. Another experiment that'd be useful is auto-filling responses to emails based on historical data (Inbox on my phone does an okay job of this). I think it's a similar challenge.

itamarwe · on Jan 24, 2016

Have you tried random forests?