Since neural nets are winning at the moment, it's easy to see SVMs as an underdo...

rm999 · on June 22, 2017

This is true, but only in the academic research world. SVMs had relatively little success on practical problems and in industry, so they never built up the kind of standing that neural networks did. Even in 2003-2005 - arguably the peak time for SVMs - neural networks were much better known to almost everyone (industry practitioners, researchers, and laypeople) than SVMs.

What frustrates me is that people who are starting out in machine learning often never learn that linear/logistic regression dominates the practical applications of ML. I've spoken to people who know the in-and-outs of various deep network architectures who don't even know how to start with building a baseline logistic regression model.

tnecniv · on June 22, 2017

What prevented SVMs from catching on in industry?

rm999 · on June 22, 2017

the tldr answer: SVMs are theoretically powerful but either impractical or pointless.

The longer answer:

SVMs comes with two theoretical benefits: 1. A guaranteed optimal solution. You get this with simpler techniques like logistic regression. 2. The ability to use non-linear kernels, which can offer much more power than logistic regression (on par with neural networks).

So, at first, it seemed like SVMs were the best of all worlds, and a lot of people got excited. In practice, though, non-linear kernels slowed training down to the point of being impractical. Linear kernels were fast enough, but removed the second benefit, so most people would prefer to use established linear techniques like logistic regression.

abhgh · on June 23, 2017

Have been in the industry for quite a while, and I think one of the often-overlooked non-technical reasons is they are hard to get an intuitive feel of, compared to ANNs. This might seem irrelevant, but think about it - someone who doesn't have a degree/rigorous training in ML or a ML heavy program (and this population is big in the industry, people who are looking to get into ML, from say analytics or software dev) will try out things he/she can identify with. ANNs are positioned exactly right for this - they're sufficiently sophisticated, you can quickly get a high-level idea, and there is the attractive comparison to how our mind works. So that's what people try out early amongst advanced algos. The industry doesn't give you a lot of time to explore, so once you have invested time to pick up ANNs, you tend to hold on to the knowledge.

I've also conducted multiple training sessions/discussions with small groups on ML, and it supports what I've said. SVMs are hard to explain to a general crowd; with ANNs I've enough visual cues to get them started. Sure, they mightn't get the math right away, but they understand enough to be comfortable using a library.

As an aside, a comment here mentions Logistic/linear regression dominates the industry. I think it is for a similar reason. They're simple to understand and try out. That doesn't make them good models, in my experience, on a bunch of real-world problems.

Now if you ask me about the technical cons of SVMs, I'd say - scalability of non-linear kernels and the fact that I've to cherry-pick kernels. Linear and RBF kernels work well on most problems, but even then, for a bunch of problems where RBF seems to work well, the number of support vectors stored by the model can be massive. If I weren't to be pedantic about it and excuse the fact that the kernel seems to be "memorizing" more than "learning", this is still a beast to run in real time. nu-SVMs address this issue to an extent, but then we are back to picking the right kernel for the task. This is one thing I love about ANNs - the kernel (or what essentially is the kernel) is learned.

shas3 · on June 22, 2017

Also, hyperparameter optimization is a bit painful. Zoubin Ghahramani of Cambridge is a champion for Gaussian process models which are as flexible as SVM/SVC, but have more disciplined approach: via well-chosen priors (Bayesian structure) and single-step hyperparameter optimization.

tanilama · on June 23, 2017

1.SVM is not interpretable just as DL.

2.Hard to parallel if you're using kernels other than linear one.

3.So-so performance.

rcar · on June 22, 2017

They tend to create difficult to interpret models that don't perform as well as other "black box" modeling methods (GBMs, neural nets, etc.)

abhgh · on June 24, 2017

This is not really true. Aside from ensemble models they tend to perform pretty much at par or better. Here's an extensive comparison by Rich Caruana [1]

[1] https://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.ic...

closed · on June 22, 2017

Was this true, or perceived as true in 2003? My understanding was that people did not see them as performing worse than NN back then.

rm999 · on June 22, 2017

Definitely true. I worked for a company that was generating millions of dollars a year from neural networks in the mid 90s (edit: to be clear I didn't work there in the 90s, I joined years after their initial buildouts). The Unreasonable Effectiveness™ of neural networks has been true for a long time. When I worked there I tried switching out some models with SVMs and they were less accurate and took 1-2 orders of magnitude more time to train.

closed · on June 23, 2017

Really useful to hear, thanks! I know psychology was gaining a lot of headway with NN models in the 90s, but had little sense for what was going on in industry.

YZF · on June 22, 2017

Weren't SVMs used for the NetFlix recommendation engine? (i.e. the NetFlix prize?)

tchalla · on June 22, 2017

I think it was Gradient Decision Boosting Trees to combine different models.

eggie5 · on June 23, 2017

No, it was an ensemble of Collaborative filtering using Matrix Factorization (using SVD) and a RBM.

sjg007 · on June 22, 2017

err.. I think it is an evolution. Deep architectures allow for a more efficient function approximation from fewer examples than shallow architectures do.

Deep (neural) networks are really just a generalization of machine learning (on graphs). The key is that we learn the similarity function to discriminate from examples instead of specifying it a priori. You can also build classifiers other ways: linear/logistic regression based on feature vectors or by providing some similarity metric (SVM). But in these cases you are providing the discriminator function.

For example, in SVMs you have to provide a similarity measure as the "kernel" maps your feature vector into a higher dimensional space where the examples can be separated.

In deep neural networks we don't really know (or care to some extent) what the optimal feature vectors are or what the correct similarity metric is. We only care that at the end we've encoded it correctly (e.g. having chosen enough parameters/layers etc...) after training. Again the NN is some general function that applies a soft-max relationship from the inputs to outputs for each layer.

Yann LeCun has a great paper (2007) explaining this:

http://yann.lecun.com/exdb/publis/pdf/bengio-lecun-07.pdf.

nkozyra · on June 22, 2017

Hell, you can (efficiently) solve a lot of problems with KNN that people are throwing dedicated hardware + Tensorflow. But that's not cool. I've learned that doing the "cool stuff" in the face of practicality is rampant because it's part of the self-fulfilling cycle of hiring people who do the cool stuff and people doing the cool stuff to get hired.

There's a whole world beyond neural networks and it seems like it's mostly all on the backburner now. Which I understand, the resurgence in NN algorithms, approaches and hardware in the last 10 years has been exciting but it does feel like tunnel vision sometimes.

timemachiner · on June 22, 2017

Fashion statements come and go in academia. There are academics doing good research in unfashionable areas and doing just fine. Plus, it's hard to see where the latest trends will go. Who thought Software Defined Networks would be a thing in the 1980s?

kefka · on June 22, 2017

> I wish machine learning research didn't respond so strongly to trends and hype,

It's really because nobody actually understands what's going on inside a ML algorithm. When you give it a ginormous dataset, what data is it really using to make its determination of

[0.0000999192346 , .91128756789 , 0 , .62819364 , 32.8172]

Because what I do for ML is do a supervised fit, then use a next to test and confirm fitness, then unleash it on untrained data and check and see. But I have no real understanding of what those numbers actually represent. I mean, does .91128756789 represent the curve around the nose, or is it skin color, or is it a facial encoding of 3d shape?

> I'm still wondering what, if anything, is going to supplant deep learning.

I think it'll be a slow climb to actual understanding. Right now, we have object identifiers in NN. They work, after TB's of images and PFLOPS of cpu/gpu time. It's only brute force with 'magic black boxes' - and that provides results but no understanding. The next steps are actually deciphering what the understanding is, or making straight-up algorithms that can differentiate between things.

thearn4 · on June 22, 2017

Shouldn't it be possible to backpropogate those categorical outputs all the way back to the inputs/features (NOT weights) after a forward pass, to localize the sensitivity of them with respect to the actual pixels for a prediction? I imagine that would have to give at least some insight.

Beyond that, the convolution/max pool repeated steps could be understood to be applying something akin to a multi-level wavelet decomposition, which is pretty well understood. It's how classical matched filtering, Haar cascading, and a wide variety of proceeding image classification methods operated at their first steps too.

CNNs/Deep learning really doesn't seem like a black box at all when examined in sequence. But to me at least, randomized ensemble methods (random forest, etc.) are actually a bit more mysterious to me in their performance out of the box, with little tuning.

azeirah · on June 22, 2017

I'm in no way a researcher or even an enthusiast of machine learning, but I'm pretty sure that I came across a paper posted on HN a few days ago that did exactly what you and the parent poster are describing, figuring out what pixels contributed most to some machine learning algorithm. I'll try and see if I can find it.

Edit: yep, found it.

SmoothGrad: removing noise by adding noise, https://arxiv.org/abs/1706.03825

Web page with explanations and examples

https://tensorflow.github.io/saliency/

I couldn't find the HN thread, but there was no discussion as far as I remember.

CuriouslyC · on June 23, 2017

Bagging and bootstrap ensemble methods aren't really that confusing. Just think of it as stochastic gradient descent on a much larger hypothetical data set.

The effect is same one that occurs when you get a group of people together to estimate the number of jelly beans in a jar. All the estimators are biased, but if that bias is drawn from a zero mean distribution, deviation of the average bias goes down as the number of estimators increases.

kefka · on June 22, 2017

I think you might be on to something, but the big problem here is that the Input is hundreds of GB or TB's . It's hard to understand what a feature is, or even why it's selected.

I can certainly observe what's being selected once the state machine is generated, but I have no clue how it was constructed to make the features. Do determine that, I have to watch the state of the machine as it "grows" to the final result.

max_ · on June 23, 2017

>People were coming up with dozens of unnecessary variations on them, everybody in the world was trying to shoehorn the word "kernel" into their paper titles, using some kind of kernel method was a surefire way to get published.

I took Andrej Karpthy's tutorial on Nueral net and learned that an SVM is basically one neuron.

http://karpathy.github.io/neuralnets/

foota · on June 23, 2017

Would you say machine learning teens to overfit?

foota · on June 23, 2017

Ugh too late to edit, but that was supposed to be tends.