I worked as an analyst for an email security company. Stopping scams, malware, phishing, etc that comes through email. I liked security and had a machine learning/stat background, so I was eager to apply. What I learned was that basically, ML and stats are not applicable to this particular problem domain because there is no level of acceptable error. No matter how well my systems performed, whenever a vice president, C level exec, or any pinhead got a single FP or FN, I have to grovel at their feet and spend weeks explaining to people what happened. I once had 10 months of work go down the drain because the system let through a single phishing email to some secretary (it detected blocked millions of malicious messages prior)
I wonder how easily you could reproduce the system you created (how much do you remember? what sorts of resources did it require?) - and how where you worked would react/respond if you did so.
If it really did catch millions of emails successfully then such a system would be highly appreciated by the majority of the market.
It's not your fault if where you worked didn't see the value in what you'd created.
I think there aren't many startups/smaller companies trying to tackle the problem because: 1) there's not much money in it. Email is supposed to be "free", so is security. 2) mail systems, esp at larger/older orgs, tend be extremely baroque and you'd need full time integration engineers before making a penny. 3) Privacy laws make things difficult. 4) You need data and a wide view of the email landscape, and you only get that at large enterprises, and 5) at those orgs, the mail server operators are just waiting for the day they can retire and leave behind the mess they made.
> I think there aren't many startups/smaller companies trying to tackle the problem
I'd agree there too - in those two specific cases.
> because: 1) there's not much money in it. Email is supposed to be "free", so is security.
Right.
> 2) mail systems, esp at larger/older orgs, tend be extremely baroque and you'd need full time integration engineers before making a penny.
No... that sounds intimidating, daunting and disillusioning.
> 3) Privacy laws make things difficult.
One word: Gmail.
See next point.
> 4) You need data and a wide view of the email landscape, and you only get that at large enterprises, and
Also one word: Gmail.
They've set what amounts to revolutionary precedence on perception of privacy and necessary ubiquitous data access.
> 5) at those orgs, the mail server operators are just waiting for the day they can retire and leave behind the mess they made.
See answer for (2).
Two alternative providers immediately come to mind: fastmail and protonmail. (Literally, these are what I remember right now.) The first has a large enough userbase they have interesting problems. The other is security-focused, and probably has adequate raw data to usefully tune an NN.
While poking around for fastmail stories I was sure I'd seen on here (it might've been another provider, but I think it was fastmail), dropbox caught my eye - they would of course need (and have) similar systems to this sort of thing too. This reminded me of sendgrid, and every other email mass-delivery provider that doesn't want to turn into a spam-farm.
Some extra consideration made me think of another idea, though.
I have a small trickle of false positives that land in my gmail spam folder (and get deleted) on a regular basis. This self-perpetuating cycle exists because I simply don't keep on top of what's in there and "not spam" everything that is technically not spam. So, my lack of interception is treated (passive-aggressively?) as agreement, and it happily chews away deleting all my old (almost-literally-spam-but-still) newsletter subscriptions and whatnot. The occasional activation email gets buried in there as well.
The key part here is that I don't feel like keeping on top of the spam thing, which is because I don't really _value_ what's in there. I trust my $mail_provider's spam system way too much to "do the right thing" (the provider is agnostic here) - and I don't mean doing the right thing morally, I mean it in the sense of "do what I mean". Which it doesn't, even though I wish it did.
I wonder how many other people feel the same [ambivalent] way about spam... and wouldn't mind simply just shoveling _everything_ in their spam folder to a "verification" service? The idea would be, you have a custom client that connects to and only reads from the spam folder, and then it un-trashes things it decides are not spam.
Gmail seems to provide API integrations to fetch only spam (pre-filled fields: https://developers.google.com/gmail/api/v1/reference/users/m... - I tried selecting for the "SPAM" and "TRASH" tags, but it seems to want a query of "in:spam" instead) and then trash/untrash messages (see bottom of opened node in list at left of that page). Naturally you can do this via POP/IMAP too but then it really is your word against your code as to what you're accessing (the query above currently only returns 15 results for me, which I just checked and is as correct).
As per https://developers.google.com/gmail/api/v1/reference/quota you can perform a cumulative total of 200,000,000 list+trash+untrash operations a say; all of them use 5 of the 1 billion quota units available daily. The main concern if you scaled sufficiently that you approached this limit is that asking for higher quota might provoke a "yes" in the form of a purchase offer :P which would have both pros and cons (stares really hard at the second word).
But besides Gmail, verification is kind of cross-platform and cross-vendor - and anything with scoped-list and untrash APIs could be integrated into this... in theory. I trust your mentioning of in-house mail being baroque, and I'm curious to hear how terrible this is in practice.
Perhaps you could have a cool UI that lets power users describe why the email is not spam - break the ranking metrics out into the UI in some way and let people provide feedback in useful (machine-readable!) terms that can (properly aggregated) be directly folded back into the network to train it.
I know Gmail has a plugin/extension API now, although I don't know how useful it would be to implement a UI for this sort of thing.
The hacky high-maintenance route would be Chrome/FF extensions.
Of course neither would work on mobile; to create a smooth experience there you'd basically have to implement a from-scratch email client with the custom spam views baked in.
I agree it's not a 'silver bullet', though I don't know what you mean exactly by first phase. That's why I stopped using any trendy terms like machine learning or convolutional neural networks (which is exactly what I was doing), because our sales engineers will sell it as if its magic. It was easier to claim that it's a regular expression change. Of course, this meant I had to use my personal GPU's for training because I couldn't get funding.
The malware example would be you have limited sandbox space, so you have some first phase detections before doing the more expensive dynamic analysis.
For mail classification you could use ML on non-spam messages and flag them as suspicious. Then that signal could be corroborated by volume, being sent by multiple senders in volume, or a user manually classifying content matching that as spam. While ML can't give perfect true positive and false positive rates, it can be combined with other signals.
ML is generally not used for serious AV systems, they are sig/pattern based.
> Then that signal could be corroborated by volume
The more serious attacks are usually very low volume, sometimes unique to the victim.
> user manually classifying content matching that as spam
Manual classification is generally not possible because of data privacy issues and sheer volume, unless you mean by the end-user... but that's a hard sell (what are you selling at the point?)