Hacker News new | past | comments | ask | show | jobs | submit login
Organizing My Emails with a Neural Net (andreykurenkov.com)
147 points by andreyk on Jan 23, 2016 | hide | past | favorite | 32 comments



Popfile is worth looking at if this was interesting to you.

It is a general purpose naïve bayesian email classifier that you can integrate with almost any email system.

They took some of the concepts in the article mentioned here and expanded on them a bit.

For example, they have the idea of "pseudowords"[1] so that you're working with more than just the words in the email. Like html:td for example...it expands to the number html table cells in an email, which might help with choosing a bucket.

[1]http://getpopfile.org/docs/faq:pseudowords


Thanks for mentioning POPFile, good to hear my old code mentioned on Hacker News. I wrote the original code for POPFile in Visual Basic(!) in 2001 (it was called AutoFile back then) as I wrote it initially for my own use without Microsoft Outlook. I got the idea of using Naive Bayes for my email classification after seeing the ifile extension to exmh(1). ifile dates to 1996.

I rewrote POPFile in Perl and made it work with generic POP3 servers (hence the name) and later with IMAP and more. It is still actively maintained by a group of people I've never met(!) with the last release in December: http://getpopfile.org/

(1) http://qwone.com/~jason/papers/ifile00.pdf


Nice, thanks for the link. I did find a few research papers and the like on this topic, but none of those seemed to have very interesting approaches/results. But this looks quite interesting. It would fun to compare Popfile to my simple approach with a publicly available dataset (such as Enron), perhaps.


CRM114 in an interesting mail sorting tool, although I don't think it has been updated in years:

http://crm114.sourceforge.net/


Orthogonal sparse bigrams are another interesting tokenization method, the authors of this paper tested it out against CRM114 with good results:

http://www.siefkes.net/ie/winnow-spam.pdf


Orthogonal sparse bigrams (OSBs) have been implemented in CRM114 since the publication of that paper.


This is really awesome-- I work at Nylas and would love to turn this into a plugin for N1 (https://nylas.com/n1). If the author's hanging out in this thread, please email me :)


As a Nylas user...please do!


Done! :)


Is it common practice to use the most frequent words as features ? It looks like they don't carry much information, by definition. As a first naive approach, I'd rank the words by the inverse of how many categories they appear in, factor in the overall frequency, with a weight or something more clever and then take the top N.

Well, thinking more about it leads me to tf-idf and naive bayes (of course), at which point you pretty much already have a classifier. So it seems feature selection is learning in itself and defines the maximum accuracy you'll be able to reach ? This is border philosophical but I'd love to read more about these matters. Pointers welcome !


> Is it common practice to use the most frequent words as features ? It looks like they don't carry much information, by definition.

The common practice with a small-ish dataset is to use e.g. the top 10k or 20k most frequent words, but filter out the top 50-100 so most frequent words, as those indeed do not carry much information. A commonly used weighting scheme is TF-IDF (https://en.wikipedia.org/wiki/Tf%E2%80%93idf), which comes included in Keras.

Anyway, this is a cool ML starter project. Keras makes it really easy to do this sort of fast experimentation with a range of different neural networks models.


As a side note, early on (in the first version of this project) I tried doing something significantly more fancy by taking the most frequent words in each category of the training set that also accounted for a large fraction of the word occurrence in all the emails (so words that have high counts of which at least 10-25% were from a single category). It seems like something like this should be better than just taking the most common words, but either the parameters I chose or my code itself was no good. Just taking a lot of the most frequent words and doing feature selection probably results in fairly similar features, though...


This is very impressive in itself but what is the practical use? I mean you probably could get (almost) the same result with much less work. For instance, "Academic" is every mail send from an .edu domain, "financial" is everything that comes from your bank, "personal" mails come from a sender that is in your personal address book and so on.

In short, there are "good enough" rules that require much less processing.


> I would really like it if gmail indeed had such a machine-learned approach to suggesting a category for each email for one-click email organizing

Yes, but I guess you need to train it first. And if you have been bad at categorizing in the first place, you will start with bad training data.


That confusion matrix makes Edward Tufte so very, very sad.

How on earth am I to tell from that visualization how often it mislabels financial emails as personal? Eyedropper the colors and hope the values in the shades of blue follow a linear scale?


As someone who regularly encounters these types of diagrams, I guess that I'm too close to the issue, because I'm not seeing the problem. To answer your question, financial e-mails are more likely to be mislabelled as personal than as professional, but not as often as it labels them correctly.

A quick curosry glance at the central diagonal tells me that Finance, Personal/Programming, Professional/EPFL, and Group work are the categories of e-mail which are most likely to be categorized incorrectly by the software. Looking at the columns tells me that the Academic and Personal are going to have the most misfiled messages in them.


> To answer your question, financial e-mails are more likely to be mislabelled as personal than as professional, but not as often as it labels them correctly.

Yes, but how likely is each mislabelling? There is no scale to indicate how the colors map onto probabilities.

As well as providing a scale, it would be helpful to make the heatmap an annotated heatmap, in which each square is labelled with the corresponding value (perhaps with values below some threshold omitted to reduce clutter).

Example: https://web.stanford.edu/~mwaskom/software/seaborn/_images/s...

(from the Seaborn docs: https://web.stanford.edu/~mwaskom/software/seaborn/generated...)

Edit: Consider another question - If an email has the true label 'Financial', is it more likely to be mislabeled or correctly labelled? I can guess, but without more knowledge of the scale I can't be certain.


If this was a formal paper, and if the code hadn't been freely available, I'd have agreed with you.

As it is, the specifics of how it mislabels his e-mail is not all that interesting to me, as I don't have access to his e-mail and so the specific numbers are pretty much irrelevant.


I agree with you it would be more sane to have a scale, but as others commented I included it more to have a rough picture of the results than something you can get a lot of specific conclusions from. To be honest for some reason the scale showed up incorrectly right in the middle of the matrix, so I just decided to omit it and leave it intuitively clear that dark blue=a lot.


>I would really like it if gmail indeed had such a machine-learned approach to suggesting a category for each email for one-click email organizing

I don't think you can create custom categories but Google's inbox.google.com does this


Why not just use all the words?


There are two main reasons to try to pare it down: The processing time, and that if you manage to pare it down to the words that are actually relevant, then you reduce the chance of over-fitting to specific features that are actually irrelevant (e.g. different frequencies of words like "is" is quite likely to be irrelevant; but of course you do this kind of filtering at your peril - what might seem irrelevant could also turn out to be highly significant in context so it's hard to get right)


In my experience it's usually best to start with all the words. If you use a decent implementation that supports sparse vectors it's no problem, certainly not for these sorts of data sizes.

Usually you'll end up with a frequency threshold, but it's usually best to trim at the very low end --- like, words occurring 5 or fewer times. Further over-fitting can be controlled with regularization and parameter averaging.


This was my reasoning as well - at some point using more words does not help (but takes more time), and with the long tail distribution of words you could potentially go up to hundreds of thousands and extremely overfit.


Glad you came to the conclusion that I always seem to come to, it's all about the features. Engineer great features and you'll get great results, bad features and you'll have bad results.


I see his conclusion is deep learning doesnt work here - but it might be interesting to incorporate word2vec features into this and see if the performance doesnt increase (or at least hover around 94%)


I ran one experiment with an Embedding layer (second one in the Deep Learning Is No Good Here section), which I assumed to be analagous to word2vec. I hoped it would help as well so was rather dissapointed it did not seem to be so. Perhaps I should have tried a few more configurations, though.


An Embedding layer by itself will try to learn word vectors from the data you train it on (error gradient will propagate back to embedding layer, update the vector weights). Word2vec and word vectors are only really useful with tons of training data to learn good embeddings. I believe OP is referring to using Google's pre-trained word-vectors (trained on massive amounts of text [100 Billion words]). Can be found here: https://code.google.com/p/word2vec/

This is pretty straightforward to implement in keras, you just need to supply pre-trained word-vectors weights to your embedding layer.

  Embedding(vocab_size, 300, weights = [word2vec_weights])
Where 'word2vec_weights' is a numpy matrix with shape (vocab_size, 300).


Ah, I honestly assumed it would use an existing set of weights such as from word2vec. Though I now realize that does not make sense since you can also embed images or really any real numbered inputs. I will give this a try.


Cool! Let me know if you run into any issues.


Cool project. Another experiment that'd be useful is auto-filling responses to emails based on historical data (Inbox on my phone does an okay job of this). I think it's a similar challenge.


Have you tried random forests?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: