From Linear Models to Machine Learning (draft) [pdf]

azuajef · on Aug 6, 2016

Thanks for sharing. I also recommend "An Introduction to Statistical Learning - with Applications in R": http://www-bcf.usc.edu/~gareth/ISL/

apathy · on Aug 6, 2016

A direct link to the PDF for ISL is here:

https://web.stanford.edu/~hastie/local.ftp/Springer/ISLR_pri...

The grownup version of that, ESL, is also available free:

https://web.stanford.edu/~hastie/local.ftp/Springer/ESLII_pr...

And for people who are genuinely curious about how this segues into graphical models, NNs, and the autoencoder (maybe the most interesting part of modern NNs), there's

https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS...

The more curious or research oriented may appreciate

https://web.stanford.edu/~hastie/local.ftp/hastie_glmnet.pdf

I doubt Gareth or Daniela (the primary authors of ISL) would mind my pointing you towards Hastie's archives since both of them were advised by Trevor Hastie during their PhDs.

Matloff is a great guy. The chapters on shrinkage and dimension reduction aren't yet written in his book, and since these are important topics, you should consider reading the others. These things are mostly of interest for people who want to draw inference about underlying processes that may be generating observed outcomes. If all you care about is prediction, fit a Random Forest or xgboost GBM or a DNN and be done with it. But if you're actually curious about how complex descriptions of rare events can be thoughtfully analyzed, this is the standard progression.

Matloff's book is a great introduction. I particularly like the example on page 204. /ducks

RhodesianHunter · on Aug 6, 2016

Has anyone come across something similar using Python?

danielmorozoff · on Aug 6, 2016

Here you go: https://github.com/JWarmenhoven/ISLR-python

R is a popular language for doing any form of stats/ learning theory work for research/ academia. Productionwise not as popular.

blahi · on Aug 6, 2016

It's pretty popular.

rm999 · on Aug 6, 2016

I've put substantial amounts of R code into production - it's a nightmare. Both for development and operationally. I think 2-3 years ago R was still a superior language for ML/data science dev work. But Python's library support has really caught up is now mature. The policy I put in place on my current team is to minimize R in production, with Python and Scala preferred. R in some cases still has the best machine learning libraries, which is really the only reason I've found to use it in the production stack. Even then, I prefer to just keep it at a few lines of R code (load the data, build the model, handle errors, export the model).

For analyzing data I love R and almost always prefer it to Python.

eric_bullington · on Aug 6, 2016

> R in some cases still has the best machine learning librari

Would you mind naming some specific machine learning techniques that are still better in R? I've been studying machine learning and linear algebra the past few months, and I'd love to have a try at implementing one myself in Python, as a learning exercise.

hcarvalhoalves · on Aug 7, 2016

Glmnet and Cox proportional hazards regression (survival analysis) are two recent examples I came across missing Python implementations.

eric_bullington · on Aug 7, 2016

Did you look in statsmodels? I appreciate the suggestions, and for a moment I was hopeful about the need for survival analysis models, but it looks like both that and GLM are well-covered in the latest version of statsmodels (don't be misled by the old sourceforge site, there's been a huge flurry of recent activity in statsmodels, hundreds of new PRs merged, look at Github and the docs site linked from that repo: http://www.statsmodels.org/stable/).

There's even a Jupyter notebook comparing the R, Stata (that takes me back, used Stata in survival analysis class 10 years ago), and Python versions of proportional hazards regression: http://nbviewer.jupyter.org/urls/umich.box.com/shared/static...

rm999 · on Aug 8, 2016

Glmnet has quite a bit of functionality that is lacking in the Python elastic net implementations. The most notable is the regularization parameter sequence grid search (alpha in statsmodel, lambda in glmnet) which works remarkably well and can be orders of magnitude faster than a traditional grid search.

hcarvalhoalves · on Aug 7, 2016

Last time we tried Cox PH regression from statsmodel it gave a (very) different result than R's, and we weren't comfortable with the lack of tests. Will give another try.

apathy · on Aug 6, 2016

not the OP, but glmnet is still the standard for lasso'ing, and I'm not aware of a Python implementation for post-selection inference and statistical testing yet.

That said, the secret sauce in all of those is FORTRAN.

eric_bullington · on Aug 7, 2016

Thanks for the suggestions, I appreciate it. Alas, at the least, there are Python implementations of all those between scikit-learn and this project: https://github.com/selective-inference/Python-software. statsmodels has GLM capabilities, and there are even Python bindings for R's glmnet FORTRAN library.

I'm not sure how well those compare to the R implementations, but they look well-built at first glance.

Any other ideas out there?

blahi · on Aug 7, 2016

Python does not lack in models. It has broad enough coverage and good enough infrastructure to construct the rest if you need it. It lacks in everything else that you need in your day-to-day modeling. DoE, graphics, utility functions, inference, etc.

apathy · on Aug 7, 2016

Interesting! Great find on the selective inference, I should have known to look. Tucking this away for when it is needed.

danielmorozoff · on Aug 6, 2016

I haven't seen it, maybe just me. Which companies do you know of?

blahi · on Aug 6, 2016

Ebay, Electronic Arts, Google, Microsoft, many financial institutions.

danielmorozoff · on Aug 6, 2016

Are these production systems? I'm a bit blown away that this can work with R.

http://adv-r.had.co.nz/Performance.html

apathy · on Aug 6, 2016

R is a glue language. The fast bits are always written in C++, FORTRAN, or wrapping a BLAS like LAPACK.

R-vs-Python is almost never the problem in production. Interpreted-vs-compiled is almost always the issue. (I'm aware of Numba and similar efforts. Last time I tried it, it sucked on nuts. And Theano is a rather specialized bit that most people don't actually need.)

JMHO. But I've never seen anyone dealing with truly huge data and inference problems that had the low-level bits in anything other than C++ or FORTRAN. I could imagine that Scala can do a pretty good job now, especially if you use Spark a lot. But R vs Python seems like a really stupid question. Use the one that has the libraries you need.

blahi · on Aug 7, 2016

This. R vs. Python is a very stupid debate. Especially for speed. For example pandas is slower than data.table. There are plenty of ways to have performant models in R. If you are worried about performance in deployment, run H2O, for example, there are plenty others. There is also flashr, if you want to write your own algorithms, which swaps R's basic operators and data management with C++.

Also google is doing to R what they did to javqscript with v8. Expect GA next year.

sevenless · on Aug 7, 2016

What is GA?

I think the right answer is both.

And probably Javascript, too.

blahi · on Aug 15, 2016

Yes, those are production systems like I said. Ebay scores their search results. EA scores customer lifetime value, churn, marketing communications.

FraaJad · on Aug 6, 2016

Use of non-monospaced fonts for code fragments in LaTeX composed books must stop.