That's the point of the comment. Modern data science tutorials assume the source...

alexhutcheson · on Dec 20, 2018

Hadley Wickham's book "R for Data Science"[1] does a good job emphasizing the data cleaning and reshaping steps in the analysis process.

[1] https://r4ds.had.co.nz/

tfehring · on Dec 20, 2018

>Modern data science tutorials assume the source data is a nice and tidy CSV

For what it's worth, I don't really see this as a problem in and of itself. Of course, if you want to do data science in the real world, you need to learn about data cleaning, manipulation, and warehousing in addition to the "pure" data science process that begins with tidy data. But it's good that tutorials segment these out, since not everyone needs to learn both at the same time. Anecdotally, I had plenty of real-world experience dealing with messy data by the time I started learning data science.

As an aside, my impression is that technical assessments for data science positions tend to underemphasize data cleaning and manipulation. Granted, they're still really time-consuming as it is, but there's probably some room for optimization there.

minimaxir · on Dec 20, 2018

> Of course, if you want to do data science in the real world, you need to learn about data cleaning, manipulation, and warehousing in addition to the "pure" data science process that begins with tidy data.

True, although ideally the data is cleaned/structured before it hits a data warehouse and the data scientist starts working with it. It's an iterative process (DS finds messy data, flags the problem upstream, repeat).

I do wish there were more tutorials with messy data; the ones I make I deliberately try to use complex datasets and highly relational ones, although as a result the tutorials are more complicated and more imposing for beginners.

> As an aside, my impression is that technical assessments for data science positions tend to underemphasize data cleaning and manipulation.

When I was interviewing for DS jobs, I got a lot of the implement-binary-search-on-a-whiteboard questions which annoyed the hell out of me. But for the takehome assignments, they required some data ETL, which I felt was more representative. (some assignments had deliberately flawed datapoints the user was expected to identify and remove while not explicitly being told to do so)