Ask HN: Statistics for hackers?

drats · on Oct 26, 2009

I can't speak highly enough of "Programming Collective Intelligence" by Toby Segaran. It's not everything, you'd need other books, but it covers "fantastic applications" of techniques. It has really clear explanations from real-world scenarios, followed by extremely clear python code and with a second explanation with code and maths of each technique used in the appendix. Check out the contents on Amazon.

gtani · on Oct 26, 2009

some other helpful books:

- Data Mining, by Witten and Franke; describes basics with rigor, including how to use Weka, which they wrote

http://www.amazon.com/Data-Mining-Practical-Techniques-Manag...

a couple java-based books from Manning:

- Collective Intelligence in Action (by Satnam Alag) and

- Algorithms of the Intelligen Web (Marmanis, Babenko)

-

waldrews · on Oct 25, 2009

Sounds like you're more into probability modelling and machine learning than statistics in the traditional hypothesis testing sense. Besides ESL, a book I'd recommend is Bishop's Pattern Recognition and Machine Learning. It starts from the beginning of probability theory applied to computer science problems, and covers every modern topic.

videolectures.net is filled with lectures on CS-flavored probability modelling and machine learning topics. The best bet is the multi-hour "tutorial" lecture series and minicourses; it may take a while to choose the right starting point.

For serious stats and probability without the CS flavoring (not useful for the quick-road-to-hacking-power agenda):

For classical deep stats theory, everyone I know begins with Cassela and Berger's Statistical Inference. Don't expect algorithms in this though.

On the probability side: Feller's An Introduction to Probability Theory and Its Applications. Deep, readable, sometimes funny, full of "whoa" insights. Would be hard to actually grok every chapter in both volumes, but you read it for insight into the power of probability and then use it as a reference.

(grad student in stats, among other things)

etal · on Oct 25, 2009

Try this:

http://www.bmj.com/collections/statsbk/index.dtl

The examples you gave make me think you've done some applied things with those specific techniques, but haven't covered the theory and related areas in depth. That's fine; the Square One series is simpler but comprehensive, so you'll be in good shape after that to investigate more on your own.

haliax · on Oct 25, 2009

Thanks! I'll give it a shot!

caffeine · on Oct 25, 2009

Go download David MacKay's Information Theory, Inference and Learning Algorithms (free book). Go through the part on Bayes and the part on Neural Nets (and the info. theory part if you want to, which is fascinating but not as directly relevant), which is a total of roughly 20-30 chapters, some very short. Do as many exercises as you can do (i.e. try them all, fail and come back later if necessary), and try implementing those algorithms. That will get you boned up on this stuff generally.

From there:

Standard references are Hastie and Tibshirani which you already have, Pattern Recognition by Duda Hart and Stork, and PRML by Chris Bishop (though I found it boring - too many unmotivated equations). All of Statistics and especially All of Nonparametric Statistics by Wasserman are both excellent books which will fairly rapidly get you introduced to large swaths of statistical models. Papoulis (1993) is quite a good reference on statistics in general, and Joy & Cover is the usual reference of choice for information theory (which is very relevant to what you're interested in), but neither of those are much fun to actually read.

You seem less interested in classification/ML problems and more interested in straight-up stats and/or timeseries stuff. So some slightly deeper references:

- Given your interests you might absolutely love Kevin Murphy's PhD thesis on Dynamic Bayes Nets, which are excellent for describing phenomena in all three fields you mentioned.

- Check out Geoff Hinton's work, especially on deep belief nets (there's a Google tech talk and a lot of papers).

- Hinton and Ghahramani have a tutorial called "Parameter Estimation for Linear Dynamical Systems", which could be directly applicable to the models you're talking about

- If you're interested in these dynamic, causal models you'll want to learn about EM (which you should know already since you know HMMs), and its generalization Variational Bayes. MacKay has a terse chapter on variational inference; http://www.variational-bayes.org/vbpapers.html has more. One of those is an introductory paper by Ghahramani and some others, which is nice.

- Pretty much everything on http://videolectures.net will excite you.

Some of those references (esp. the VB stuff) can get slightly hairy in terms of the maths level required (depending on your background). Bayesian Data Analysis with R (by Jim Albert), or Crawley's R book (for a more frequentist approach), can get you started using R which can avoid you needing to implement all this stuff yourself, as much of it is already implemented. This might be your fastest route to writing code that does cool stuff - understand what the algo is, use somebody else's implementation, apply it to your own problem.

tokenadult · on Oct 25, 2009

Here (perhaps for onlookers more than for your exact case) are more two favorite recommendations for free Web-based resources on what statistics is as a discipline, both of which recommend good textbooks for follow-up study:

"Advice to Mathematics Teachers on Evaluating Introductory Statistics Textbooks" by Robert W. Hayden

http://statland.org/MyPapers/MAAFIXED.PDF

"The Introductory Statistics Course: A Ptolemaic Curriculum?" by George W. Cobb

http://repositories.cdlib.org/cgi/viewcontent.cgi?article=10...

Both are excellent introductions to what statistics is as a discipline and how it is related to, but distinct from, mathematics.

A very good list of statistics textbooks appears here:

http://web.mac.com/mrmathman/MrMathMan/New_Teacher_Resources...

silentbicycle · on Oct 25, 2009

This isn't directed at the author of the top post, but _The Cartoon Guide to Statistics_ is actually pretty good. It's a quick read (cartoons, hey), but works well as a quick refresher, or would be enough of an intro to pick up terminology for more pointed questions. (It's also fairly cheap, libraries might have it, etc.)

haliax · on Oct 25, 2009

If it helps, the sort of questions that interest me are: what types of sounds are pleasing to the ear, or, what rules do I have to constrain randomness within before I can generate music that sounds like _____, or (presumably with a factor model) what team is most like to win the world cup =P

keenerd · on Oct 25, 2009

You could try looking at existing generative music schemes. This one is classic[al]:

http://en.wikipedia.org/wiki/Musikalisches_Würfelspiel

brent · on Oct 26, 2009

I wouldn't spend too much time on the world cup problem :-).

http://arxiv.org/abs/0909.4555

sh1mmer · on Oct 25, 2009

I've been pretty impressed by the O'Reilly book on statistic: http://www.amazon.com/Statistics-Nutshell-Desktop-Reference-...

fhars · on Oct 26, 2009

I am pretty ambivalent about that book. It covers a lot of territory with pointers to deeper literature with a clear emhasis on using these techniques with a statistics program like SPSS (or R), which is nice, but it still wastes a lot of space printing intermediate tables for simple examples which don't do much except filling pages.

But the real bummer is the editorial quality. There aren't any three consecutive pages without major typographical or editorial errors like missing parantheses in complex formulas or cases where they obviolsly replaced examples with simpler ones but forgot to change the illustrations together with the text.

pmichaud · on Oct 26, 2009

This is a site that's really good for beginners:

http://www.statisticshowto.com

And it's really relevant here because the approach to everything is step by step -- the author of the site probably doesn't realize it, but the tutorial steps practically read like pseudo code... seems like it could really help you.

Also, there are some calculators on there, and I've seen the code, which isn't bad, and it's not obfuscated so if you want to get an idea of how to implement something, you can just look at the source directly.

lliiffee · on Oct 26, 2009

One of the things EoSL emphasizes is that simple methods can often give very good results. If you have been through the first few chapters of the book, you should be able to do some cool stuff with nearest neighbors or linear classifiers. (The reality is that on most problems, fancy methods give a slight increase over simple, especially when they are implemented by someone really skilled at getting good performance out of simple methods.)

Nearest neighbors methods can be implemented in something like 3 lines, so you have no excuse!

tsally · on Oct 25, 2009

Hidden Markov models are serious stuff. If you already understand them, leverage that power and go build an application! No point in learning the theory if you aren't going to apply it.

haliax · on Oct 25, 2009

I've built some text generators, and a POS tagger. But, for much of the more interesting stuff I want to do, it seems like I need things like time-series, or regression models -- which elude me entirely at the moment.

dotBen · on Oct 26, 2009

The trick for me is to learn a good foundation in statistics but to know that you don't need to learn everything.

A friend who founded a startup that makes heavy uses of statistics likes to trawl academic papers for algorithms that help his business.

Think of research papers a bit like a well documented private object/library -- you know what data it accepts, you know what it returns, but you don't need to know how it works.

Just make sure your code reflects exactly the formulae/model documented in the paper and you're good.

eru · on Oct 26, 2009

And the assumptions in the paper.

Mongoose · on Oct 26, 2009

I'm in a similar situation, as I'm trying to decide whether or not to minor in statistics on top of computer engineering. From a hacker's perspective, I've found that playing around with the language R is the best way to relate to the field. Check our the tutorial R for Programmers linked from here: http://heather.cs.ucdavis.edu/~matloff/r.html

Anon84 · on Oct 26, 2009

http://news.ycombinator.com/item?id=902478

whimsy · on Oct 25, 2009

As far as application of statistics to automated composition goes, definitely check out the EMI project over at UC Santa Cruz.

jakecarpenter · on Oct 26, 2009

iTunesU has some courses for stats, but they may not be suitable for learning in a hurry.

-jc

zackattack · on Oct 25, 2009

You should write a book, statistics for hackers. I would buy it. Make sure it explains things really well. The best person to teach someone is a beginner, because they understand the beginner's perspective. So you are in a unique position to create this.

O'Reillys statistics in a nutshell is a good reference* book, but not quite a textbook. Here you go. Including my refid. http://www.amazon.com/gp/product/0596510497?ie=UTF8&tag=...

*True masters have beginner perspectives, so they are good teachers as well.

haliax · on Oct 25, 2009

I'm tempted. Though might a blog be better? Also, it might be nice to have someone to do it with, as then I could get explanations for things that befoggle me.

liebke · on Oct 25, 2009

I've enjoyed blogging about statistical programming (http://incanter-blog.org), it's been a great way to dig into subjects that interest me.

zackham · on Oct 26, 2009

I'd like to learn the same things as you - interested in doing so collaboratively? I am also interested in exploring online collaborative learning methods, so this appeals to me for both reasons. Find my email in my profile and get in touch if you're interested, I have a few ideas on how this could benefit us both.

zackattack · on Oct 25, 2009

It'd be better to have it with a table of contents. You could probably get sufficient explanations through a combination of research on wikipedia and then probing #math on freenode.

anonymousDan · on Oct 25, 2009

Just out of interest, how long does it generally take for a question to get answered on #math? Does it vary quite a lot, or is it fairly stable (assuming someone can actually answer it)?

zackattack · on Oct 26, 2009

i typically get answers immediately, but i've only asked questions up to college algebra

joeycfan · on Oct 26, 2009

That's a lot, actually.