Author here: yeah I ended up insisting a bit much on dbt as a basis although the method I'm describing in the article can definitely be applied to any kind of modeling framework (or lack of)
Anyway, please let me know what you think of the general method and the bottom-up VS top-down approach ;-)
Good detailed breakdown of how to use dbt, but I'm not sure about the data modelling part (the specific approach suggested), as it seems possibly contradictory:
> In contrast, the approach I want to outline here is primarily bottom-up, in order to:
- deliver value and insights as quickly as possible
- not pile up too much (data modeling) technical debt along the way.
> The goal will be to aim towards either:
- a dimensional model/star schema
- or a limited collection of "One Big Tables"
(or any other top-down theory you’re most comfortable with...)
I get (and agree) that you want to start off organic and get things moving with some integrations, but I quite strongly feel that with the suggested approach (which feels to me like - just start with the dbt part, don't worry about the final design part), especially if you don't decide _up front_ what capabilities you, your team, your client have, and choose an approach based on that, you'll likely end up with something of a technical debt mess.
My suggestion would be, unless otherwise advised, to just adopt the dimensional model, put that in the hands of your users early, and start getting feedback. In the meantime you can do exactly what is described, but with a feedback loop kicked off much earlier than you would have otherwise.
edit - maybe contradictory is not correct, the suggestion stands though :)
I have nothing against dimensional modeling per-say, just the uptime effort (and initial feedback lag) that it generally brings with it in the beginning.
Another issue I've observed with teams trying to formalize their modeling "too" soon is a confusion in the models created. I find that it's sometimes difficult for teams to understand what should be in each model, how to separate the different dimensions, etc. Hence my emphasis on starting naively in the beginning and then aiming for that dimensional model when things start becoming clearer...
The outcomes are a mixed bag, but my biggest experienced failing was on a project starting naively with OBT and waiting for things to become clearer, it did indeed soon became clear that we should have taken a more methodical approach, as we had the dual problem of huge dependencies and a very brittle data model!
(I don't think there is an easy answer here btw, I am hoping for someone to describe a detailed and relatively generalisable playbook, hopefully soon!)
Meh, depends on the data teams (and their chosen suite of tools)... Or on the Excel users
Both sides can either hack together horrible work-arounds (a matter of "when you have a hammer, everything looks like a nail...") as well as brilliantly thought through solutions.
Each tool should be used for it's best use cases, but not bent into what it wasn't designed for!
IMHO spreadsheets excel at intuitively manipulating the data ON the data itself.
While "modern data" tools (especially dbt) try to convert date teams to use developer best practices... At the expense of less intuitive/direct manipulation of the data.
That being said, I think there are also things we could explore in that space: how to make the modern data stack more intuitive?
> how to make the modern data stack more intuitive?
I'd start with getting people to learn relational data modeling and SQL [1][2] at a deep level. Stop reaching for python and pandas/spark for every basic data manipulation task or query. Stop adding in layers of Airflow/Dagster/Prefect when a simple cron would work. Stop adding in Kubernetes/GKE/Fargate to manage the aforementioned. Stop moving data between systems constantly (meltano, airbyte, fivetran) when you already have it in a perfectly good place. Stop with the toxic positivity that's completely overflowing the modern-data world and all these bullshit VC-infused startups who are convinced they need every single element and more.
99% of business needs can be satisfied by a single Postgres/Mysql installation and a halfway-competent person armed with SQL and an understanding of normalization. Reach for Excel when you need to do more "hands-on" analysis, business modeling, charting, basic forecasting, and presentation for non-technical users.
I definitely agree with the over-hype making simple mundane tasks way harder than they should be.
So yeah, do NOT over-engineer!!
But, on the other hand, doing everything with a single Postgres and spreadsheets seems to go with the hammer-to-nail adage.
And all too often, you end up with unmaintainable duck-taped hack-arounds...
Which is clearly NOT better (nor necessarily worst) than the over-engineered solution.
In some cases (maybe not 1% but clearly not the majority either), it does make sense to look at other tools that might be available.
That being said, there are waaaay too many options to filter through, because of that darn hype bubble.
Author here: yeah I ended up insisting a bit much on dbt as a basis although the method I'm describing in the article can definitely be applied to any kind of modeling framework (or lack of)
Anyway, please let me know what you think of the general method and the bottom-up VS top-down approach ;-)