Hacker News new | past | comments | ask | show | jobs | submit login

I've heard this too and it's a great way to demonstrate you don't really know what statistics is :)

Statistics is not (just) opinion polling, there's a lot more to it than estimating observable properties of a population.

If you're trying to make decisions, predictions or estimates which involve any uncertainty at all (and in my experience big data almost always is), then it's definitely within the purview of statistics even if you have data for the whole population.

Sources of uncertainty include trying to say anything at all about the future (do you have data on the future population? no didn't think so...), trying to make predictions which generalise to new data in general, trying to uncover underlying trends or patterns behind the data you see which aren't directly or fully observed.

Often people expect big data to be able to answer big numbers of questions, estimate big numbers of quantities, or fit big, powerful predictive models with lots of parameters. In these cases statistics can be particularly important to avoid reporting false positives and to make sure you can quantify how certain you are about your results and your predictions. (Amongst other reasons).




Not to mention: having all the data, and comprehending all the rows on an individual level, are two very different things. Doubly so if the data is irregular (I'm currently doing fuzzy matching on really mangled street address data. ICK).

Once you hit millions of rows, it's not humanly possible to survey the data. All you can do is make assertions about the data's structure / buckets it will fall into. You then try to disprove that assertion, or establish an error bounds on it. You will never see all the data, only the results of assumptions you've made about it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: