Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Statistics for hackers?
99 points by haliax on Oct 25, 2009 | hide | past | favorite | 31 comments
Hi again HN,

I've been trying to learn more about statistics of late, motivated by some really fantastic applications I've seen, like automated composition of music, medical models, and stock market tools.

Atm I've been going through the book Elements of Statistical Learning, which I got from the frontpage a few days ago. But it's kind of slow going, since without really knowing how things relate to each other all I can do is go through it sequentially. What I want is to jump in with both feet, and start writing cool code.

Does anyone know of good books or articles for someone in my situation? Or you give me sort of a minimal spanning roadmap for what I need before I can start having some fun?

I know about basic probability theory, bayesian text classification and hidden markov models, but that's about all.




I can't speak highly enough of "Programming Collective Intelligence" by Toby Segaran. It's not everything, you'd need other books, but it covers "fantastic applications" of techniques. It has really clear explanations from real-world scenarios, followed by extremely clear python code and with a second explanation with code and maths of each technique used in the appendix. Check out the contents on Amazon.


some other helpful books:

- Data Mining, by Witten and Franke; describes basics with rigor, including how to use Weka, which they wrote

http://www.amazon.com/Data-Mining-Practical-Techniques-Manag...

a couple java-based books from Manning:

- Collective Intelligence in Action (by Satnam Alag) and

- Algorithms of the Intelligen Web (Marmanis, Babenko)

-


Sounds like you're more into probability modelling and machine learning than statistics in the traditional hypothesis testing sense. Besides ESL, a book I'd recommend is Bishop's Pattern Recognition and Machine Learning. It starts from the beginning of probability theory applied to computer science problems, and covers every modern topic.

videolectures.net is filled with lectures on CS-flavored probability modelling and machine learning topics. The best bet is the multi-hour "tutorial" lecture series and minicourses; it may take a while to choose the right starting point.

For serious stats and probability without the CS flavoring (not useful for the quick-road-to-hacking-power agenda):

For classical deep stats theory, everyone I know begins with Cassela and Berger's Statistical Inference. Don't expect algorithms in this though.

On the probability side: Feller's An Introduction to Probability Theory and Its Applications. Deep, readable, sometimes funny, full of "whoa" insights. Would be hard to actually grok every chapter in both volumes, but you read it for insight into the power of probability and then use it as a reference.

(grad student in stats, among other things)


Try this:

http://www.bmj.com/collections/statsbk/index.dtl

The examples you gave make me think you've done some applied things with those specific techniques, but haven't covered the theory and related areas in depth. That's fine; the Square One series is simpler but comprehensive, so you'll be in good shape after that to investigate more on your own.


Thanks! I'll give it a shot!


Go download David MacKay's Information Theory, Inference and Learning Algorithms (free book). Go through the part on Bayes and the part on Neural Nets (and the info. theory part if you want to, which is fascinating but not as directly relevant), which is a total of roughly 20-30 chapters, some very short. Do as many exercises as you can do (i.e. try them all, fail and come back later if necessary), and try implementing those algorithms. That will get you boned up on this stuff generally.

From there:

Standard references are Hastie and Tibshirani which you already have, Pattern Recognition by Duda Hart and Stork, and PRML by Chris Bishop (though I found it boring - too many unmotivated equations). All of Statistics and especially All of Nonparametric Statistics by Wasserman are both excellent books which will fairly rapidly get you introduced to large swaths of statistical models. Papoulis (1993) is quite a good reference on statistics in general, and Joy & Cover is the usual reference of choice for information theory (which is very relevant to what you're interested in), but neither of those are much fun to actually read.

You seem less interested in classification/ML problems and more interested in straight-up stats and/or timeseries stuff. So some slightly deeper references:

- Given your interests you might absolutely love Kevin Murphy's PhD thesis on Dynamic Bayes Nets, which are excellent for describing phenomena in all three fields you mentioned.

- Check out Geoff Hinton's work, especially on deep belief nets (there's a Google tech talk and a lot of papers).

- Hinton and Ghahramani have a tutorial called "Parameter Estimation for Linear Dynamical Systems", which could be directly applicable to the models you're talking about

- If you're interested in these dynamic, causal models you'll want to learn about EM (which you should know already since you know HMMs), and its generalization Variational Bayes. MacKay has a terse chapter on variational inference; http://www.variational-bayes.org/vbpapers.html has more. One of those is an introductory paper by Ghahramani and some others, which is nice.

- Pretty much everything on http://videolectures.net will excite you.

Some of those references (esp. the VB stuff) can get slightly hairy in terms of the maths level required (depending on your background). Bayesian Data Analysis with R (by Jim Albert), or Crawley's R book (for a more frequentist approach), can get you started using R which can avoid you needing to implement all this stuff yourself, as much of it is already implemented. This might be your fastest route to writing code that does cool stuff - understand what the algo is, use somebody else's implementation, apply it to your own problem.


Here (perhaps for onlookers more than for your exact case) are more two favorite recommendations for free Web-based resources on what statistics is as a discipline, both of which recommend good textbooks for follow-up study:

"Advice to Mathematics Teachers on Evaluating Introductory Statistics Textbooks" by Robert W. Hayden

http://statland.org/MyPapers/MAAFIXED.PDF

"The Introductory Statistics Course: A Ptolemaic Curriculum?" by George W. Cobb

http://repositories.cdlib.org/cgi/viewcontent.cgi?article=10...

Both are excellent introductions to what statistics is as a discipline and how it is related to, but distinct from, mathematics.

A very good list of statistics textbooks appears here:

http://web.mac.com/mrmathman/MrMathMan/New_Teacher_Resources...


This isn't directed at the author of the top post, but _The Cartoon Guide to Statistics_ is actually pretty good. It's a quick read (cartoons, hey), but works well as a quick refresher, or would be enough of an intro to pick up terminology for more pointed questions. (It's also fairly cheap, libraries might have it, etc.)


If it helps, the sort of questions that interest me are: what types of sounds are pleasing to the ear, or, what rules do I have to constrain randomness within before I can generate music that sounds like _____, or (presumably with a factor model) what team is most like to win the world cup =P


You could try looking at existing generative music schemes. This one is classic[al]:

http://en.wikipedia.org/wiki/Musikalisches_Würfelspiel


I wouldn't spend too much time on the world cup problem :-).

http://arxiv.org/abs/0909.4555


I've been pretty impressed by the O'Reilly book on statistic: http://www.amazon.com/Statistics-Nutshell-Desktop-Reference-...


I am pretty ambivalent about that book. It covers a lot of territory with pointers to deeper literature with a clear emhasis on using these techniques with a statistics program like SPSS (or R), which is nice, but it still wastes a lot of space printing intermediate tables for simple examples which don't do much except filling pages.

But the real bummer is the editorial quality. There aren't any three consecutive pages without major typographical or editorial errors like missing parantheses in complex formulas or cases where they obviolsly replaced examples with simpler ones but forgot to change the illustrations together with the text.


This is a site that's really good for beginners:

http://www.statisticshowto.com

And it's really relevant here because the approach to everything is step by step -- the author of the site probably doesn't realize it, but the tutorial steps practically read like pseudo code... seems like it could really help you.

Also, there are some calculators on there, and I've seen the code, which isn't bad, and it's not obfuscated so if you want to get an idea of how to implement something, you can just look at the source directly.


One of the things EoSL emphasizes is that simple methods can often give very good results. If you have been through the first few chapters of the book, you should be able to do some cool stuff with nearest neighbors or linear classifiers. (The reality is that on most problems, fancy methods give a slight increase over simple, especially when they are implemented by someone really skilled at getting good performance out of simple methods.)

Nearest neighbors methods can be implemented in something like 3 lines, so you have no excuse!


Hidden Markov models are serious stuff. If you already understand them, leverage that power and go build an application! No point in learning the theory if you aren't going to apply it.


I've built some text generators, and a POS tagger. But, for much of the more interesting stuff I want to do, it seems like I need things like time-series, or regression models -- which elude me entirely at the moment.


The trick for me is to learn a good foundation in statistics but to know that you don't need to learn everything.

A friend who founded a startup that makes heavy uses of statistics likes to trawl academic papers for algorithms that help his business.

Think of research papers a bit like a well documented private object/library -- you know what data it accepts, you know what it returns, but you don't need to know how it works.

Just make sure your code reflects exactly the formulae/model documented in the paper and you're good.


And the assumptions in the paper.


I'm in a similar situation, as I'm trying to decide whether or not to minor in statistics on top of computer engineering. From a hacker's perspective, I've found that playing around with the language R is the best way to relate to the field. Check our the tutorial R for Programmers linked from here: http://heather.cs.ucdavis.edu/~matloff/r.html



As far as application of statistics to automated composition goes, definitely check out the EMI project over at UC Santa Cruz.


iTunesU has some courses for stats, but they may not be suitable for learning in a hurry.

-jc


You should write a book, statistics for hackers. I would buy it. Make sure it explains things really well. The best person to teach someone is a beginner, because they understand the beginner's perspective. So you are in a unique position to create this.

O'Reillys statistics in a nutshell is a good reference* book, but not quite a textbook. Here you go. Including my refid. http://www.amazon.com/gp/product/0596510497?ie=UTF8&tag=...

*True masters have beginner perspectives, so they are good teachers as well.


I'm tempted. Though might a blog be better? Also, it might be nice to have someone to do it with, as then I could get explanations for things that befoggle me.


I've enjoyed blogging about statistical programming (http://incanter-blog.org), it's been a great way to dig into subjects that interest me.


I'd like to learn the same things as you - interested in doing so collaboratively? I am also interested in exploring online collaborative learning methods, so this appeals to me for both reasons. Find my email in my profile and get in touch if you're interested, I have a few ideas on how this could benefit us both.


It'd be better to have it with a table of contents. You could probably get sufficient explanations through a combination of research on wikipedia and then probing #math on freenode.


Just out of interest, how long does it generally take for a question to get answered on #math? Does it vary quite a lot, or is it fairly stable (assuming someone can actually answer it)?


i typically get answers immediately, but i've only asked questions up to college algebra


That's a lot, actually.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: