He kind of already did that -- designed and wrote git in a month.

tester34 · on July 1, 2021

I don't think I'd call git "data (science) engineering tool"

test_epsilon · on July 1, 2021

Okay but I didn't say git was a data engineering tool. Designing git was very much an exercise in data engineering that pushed the boundaries of existing concepts so far that it turned into a multi billion dollar industry within a decade, and has displaced almost all competitors (certainly by volume), many of which were entrenched for decades.

From wiki

"data engineering, is a software engineering approach to designing and developing information systems."

"An information system (IS) is a formal, sociotechnical, organizational system designed to collect, process, store, and distribute information."

So what do you think you would you call it?

tester34 · on July 1, 2021

disclaimer: I have totally no experience in data-ish stuff.

>"data engineering, is a software engineering approach to designing and developing information systems."

this description is so broad, that every SE would be data engineer now?

for me git is just text comparison, management + some kind of visualisation tool (so it kinda sounds like data X stuff, doesn't it? I'll back to it below)

meanwhile data X jobs seems to be more focused on:

understanding business domain

using e.g statistical methods

working with huge data collections with various tools like hadoop, spark and fancy keywords like data lake, warehouse

>Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data,[1][2] and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to data mining, machine learning and big data.

>Data science is a "concept to unify statistics, data analysis, informatics, and their related methods" in order to "understand and analyze actual phenomena" with data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge.

I feel like git doesn't fits this description, I'd say it's more like handy tool that handles snapshoots of some text

_______

or maybe I'm misunderstanding relation between "data engineer" and "data science"?

test_epsilon · on July 2, 2021

> for me git is just text comparison, management + some kind of visualisation tool

That's not just what git is.

> meanwhile data X jobs seems to be more focused on:

> understanding business domain

git is absolutely about that. The technical aspects of git (which themselves are revolutionary data engineering innovations) are entirely driven by the problem domain. The people and their processes. It wasn't just developed out of thin air and people by pure coincidence decided they liked it and started using it!

> working with huge data collections with various tools like hadoop, spark and fancy keywords like data lake, warehouse [etc]

That's not an adequate definition of data engineer.

[data science]

Yeah we're talking about data engineer. Although you could say there was most probably data science work behind git to find performance and storage requirements and optimize it, just not so much of the buzzword type of data science.