And for people who are genuinely curious about how this segues into graphical models, NNs, and the autoencoder (maybe the most interesting part of modern NNs), there's
I doubt Gareth or Daniela (the primary authors of ISL) would mind my pointing you towards Hastie's archives since both of them were advised by Trevor Hastie during their PhDs.
Matloff is a great guy. The chapters on shrinkage and dimension reduction aren't yet written in his book, and since these are important topics, you should consider reading the others. These things are mostly of interest for people who want to draw inference about underlying processes that may be generating observed outcomes. If all you care about is prediction, fit a Random Forest or xgboost GBM or a DNN and be done with it. But if you're actually curious about how complex descriptions of rare events can be thoughtfully analyzed, this is the standard progression.
Matloff's book is a great introduction. I particularly like the example on page 204. /ducks
I've put substantial amounts of R code into production - it's a nightmare. Both for development and operationally. I think 2-3 years ago R was still a superior language for ML/data science dev work. But Python's library support has really caught up is now mature. The policy I put in place on my current team is to minimize R in production, with Python and Scala preferred. R in some cases still has the best machine learning libraries, which is really the only reason I've found to use it in the production stack. Even then, I prefer to just keep it at a few lines of R code (load the data, build the model, handle errors, export the model).
For analyzing data I love R and almost always prefer it to Python.
> R in some cases still has the best machine learning librari
Would you mind naming some specific machine learning techniques that are still better in R? I've been studying machine learning and linear algebra the past few months, and I'd love to have a try at implementing one myself in Python, as a learning exercise.
Did you look in statsmodels? I appreciate the suggestions, and for a moment I was hopeful about the need for survival analysis models, but it looks like both that and GLM are well-covered in the latest version of statsmodels (don't be misled by the old sourceforge site, there's been a huge flurry of recent activity in statsmodels, hundreds of new PRs merged, look at Github and the docs site linked from that repo: http://www.statsmodels.org/stable/).
Glmnet has quite a bit of functionality that is lacking in the Python elastic net implementations. The most notable is the regularization parameter sequence grid search (alpha in statsmodel, lambda in glmnet) which works remarkably well and can be orders of magnitude faster than a traditional grid search.
Last time we tried Cox PH regression from statsmodel it gave a (very) different result than R's, and we weren't comfortable with the lack of tests. Will give another try.
not the OP, but glmnet is still the standard for lasso'ing, and I'm not aware of a Python implementation for post-selection inference and statistical testing yet.
That said, the secret sauce in all of those is FORTRAN.
Thanks for the suggestions, I appreciate it. Alas, at the least, there are Python implementations of all those between scikit-learn and this project: https://github.com/selective-inference/Python-software. statsmodels has GLM capabilities, and there are even Python bindings for R's glmnet FORTRAN library.
I'm not sure how well those compare to the R implementations, but they look well-built at first glance.
Python does not lack in models. It has broad enough coverage and good enough infrastructure to construct the rest if you need it. It lacks in everything else that you need in your day-to-day modeling. DoE, graphics, utility functions, inference, etc.
R is a glue language. The fast bits are always written in C++, FORTRAN, or wrapping a BLAS like LAPACK.
R-vs-Python is almost never the problem in production. Interpreted-vs-compiled is almost always the issue. (I'm aware of Numba and similar efforts. Last time I tried it, it sucked on nuts. And Theano is a rather specialized bit that most people don't actually need.)
JMHO. But I've never seen anyone dealing with truly huge data and inference problems that had the low-level bits in anything other than C++ or FORTRAN. I could imagine that Scala can do a pretty good job now, especially if you use Spark a lot. But R vs Python seems like a really stupid question. Use the one that has the libraries you need.
This. R vs. Python is a very stupid debate. Especially for speed. For example pandas is slower than data.table. There are plenty of ways to have performant models in R. If you are worried about performance in deployment, run H2O, for example, there are plenty others. There is also flashr, if you want to write your own algorithms, which swaps R's basic operators and data management with C++.
Also google is doing to R what they did to javqscript with v8. Expect GA next year.