There is none. The industry is being flooded with DS and "AI" majors (and other generally non-technical people) that have zero historical context on storage and database systems - and so everything needs to be reinvented (but in Python this time) and rebranded. At the end of the day you're simply looking at different mixtures of relational databases, key-value stores, graph databases, caches, time-series databases, column stores, etc. The same stuff we've had for 50+ years.
Two main differences - ability to time travel for training data generation and the ability to push compute to the write side of the view rather than the read side for low latency feature serving.
There's a lot more to it than snapshots or timestamped columns when it comes to ML training data generation. We often have windowed aggregations that need to computed as of precise intra-day timestamps in order to achieve parity between training data (backfilled in batch) and the data that is being served online realtime (with streaming aggregations being computed realtime).
Standard OLAP solutions right now are really good at "What's the X day sum of this column as of this timestamp", but when every row of your training data has a precise intra-day timestamp that you need windowed aggregations to be accurate as-of, this is a different challenge.
And when you have many people sharing these aggregations, but with potentially different timestamps/timelines, you also want them sharing partial aggregations where possibly for efficiency.
All of this is well beyond the scope that is addressed by standard OLAP data solutions.
Not to mention the fact that the offline computation needs to translate seamlessly to power online serving (i.e. seeding feature values, and combining with streaming realtime aggregations), and the need for online/offline consistency measurement.
That's why a lot of teams don't even bother with this, and basically just log their feature values from online to offline. But this limits what kind of data they can use, and also how quickly they can iterate on new features (need to wait for enough log data to accumulate before you can train).
> Standard OLAP solutions right now are really good at "What's the X day sum of this column as of this timestamp", but when every row of your training data has a precise intra-day timestamp that you need windowed aggregations to be accurate as-of, this is a different challenge.
As long as your OLAP table/projection/materialized view is sorted/clustered by that timestamp, it will be able to efficiently pick only the data in that interval for your query, regardless of the precision you need.
> And when you have many people sharing these aggregations, but with potentially different timestamps/timelines, you also want them sharing partial aggregations where possibly for efficiency.
> All of this is well beyond the scope that is addressed by standard OLAP data solutions.
I think the StarRocks open-source OLAP DB supports this as a query rewrite mechanism that optimizes performance by using data from materialized views. It can build UNION queries to handle date ranges [1]
I’m still not seeing how this is a novel problem. You just apply a filter to your timestamp column and re-run the window function. It will give you the same value down to the resolution of the timestamp every time.
Let's try an example: `average page views in the last 1, 7, 30, 60, 180 days`
You need these values accurate as of ~500k timestamps for 10k different page ids, with significant skew for some page ids.
So you have a "left" table with 500k rows, each with a page id and timestamp. Then you have a `page_views` table with many millions/billions/whatever rows that need to be aggregated.
Sure, you could do this with backfill with SQL and fancy window functions. But let's just look at what you would need to do to actually make this work, assuming you wanted it to be serving online with realtime updates (from a page_views kafka topic that is the source of the page views table):
For online serving:
1. Decompose the batch computation to SUM and COUNT and seed the values in your KV store
2. Write the streaming job that does realtime updates to your SUMs/COUNTs.
3. Have an API for fetching and finalizing the AVERAGE value.
For Backfilling:
1. Write your verbose query with windowed aggregations (I encourage you to actually try it).
2. Often you also want a daily front-fill job for scheduled retraining. Now you're also thinking about how to reuse previous values. Maybe you reuse your decomposed SUMs/COUNTs above, but if so you're now orchestrating these pipelines.
For making sure you didn't mess it up:
1. Compare logs of fetched features to backfilled values to make sure that they're temporally consistent.
For sharing:
1. Let's say other ML practitioners are also playing around with this feature, but with a different timelines (i.e. different timestamps). Are they redoing all of the computation? Or are you orchestrating caching and reusing partial windows?
So you can do all that, or you can write a few lines of python in Chronon.
Now let's say you want to add a window. Or say you want to change it so it's aggregated by `user_id` rather than `page_id`. Or say you want to add other aggregations other than AVERAGE. You can redo all of that again, or change a few lines of Python.
I admit this is a bit outside my wheelhouse so I’m probably still missing something.
Isn’t this just a table with 5bn rows of timestamp, page_type, page_views_t1d, page_views_t7d, page_views_t30d, page_views_t60d, and page_views_t180d? You can even compute this incrementally or in parallel by timestamp and/or page_type.
For offline computation, it is okay with the table with 5bn rows. But for online serving, it would be really challenging to serve the features at a few milliseconds.
But even for offline computation, for the same computation logic, the code will be duplicated in lots of places. we have observed the ML practitioners copied sql queries all over. In the end, it is not possible for debugging, feature interpretability and lineage.
Chronon abstracts all those away so that ML practitioners can focus on the core problems they are dealing with, rather than spending time on the ML Ops.
For an extreme use case, one user defined 1000 features with 250 lines of code, which is definitely impossible with SQL queries, not to even mention the extra work to serve those features.
How does Chronon do this faster than the precomputed table? And in a single docker container? Is it doing logically similar operations but just automating the creation and orchestration of the aggregation tasks? How does it work?
We utilize a lambda architecture, which incorporates the concept of precomputed tables as well. Those precomputed tables store intermediate representation of the final results. These precomputed tables are capable of providing snapshot or daily accuracy features. However, when it comes to real-time features that require point-in-time correctness, using precomputed tables may present challenges.
For the offline computations, we will reuse those intermediate results to avoid calculation from the beginning again. So the engine can actually scale sub-linearly.
> So you can do all that, or you can write a few lines of python in Chronon.
It all seems a bit handvwavy here. Will Chronon work as well as the SQL version or be correct? I vote for an LLM tool to help you write those queries. Or is that effectively what Chronon is doing?
For correctness, yes, it works as well as the SQL version. And the aggregation can be extensible for other operations easily. For example, we have an operation of last, which is not even available in standard SQL.
I’ll stop short of calling comparisons to standard SQL disingenuous but it’s definitely unrealistic because no standard SQL implementation exists.
What does this “last” operation do? There’s definitely a LAST_VALUE() window function in the databases I use. It is available in Postgres, Redshift, EMR, Oracle, MySQL, MSSQL, Bigquery, and certainly others I am not aware of.
Actually, Last is usually called last_k(n), so that you can specify the number of the values in the result array. For example, if the input column is page_view_id and n = 300, it will return the last 300 page_view_id as an array. If a window is used, for example, 7d, it will truncate the results to the past 7d. The LAST_VALUE() seems to return the last value from an ordered set. Hope that helps. Thanks for your interests.
What's with the dismissiveness? The author is a senior staff engineer at a huge company & has worked in this space for years. I'd suspect they've done their diligence...
https://en.wikipedia.org/wiki/Sixth_normal_form
Basically we've had time travel (via triggers or built in temporal tables or just writing the data) for a long time, its just expensive to have it all for an OLTP database.
We've also had slowly changing dimensions to solve this type of problem for a decent amount of time for the labels that sit on top of everything, though really these are just fact tables with a similar historical approach.
6NF works well for some temporal data, but I haven't seen it work well for windowed aggregations because the start/end time format of saving values doesn't handle events "falling out of the window" too well. At least the examples I've seen have values change due to explicit mutation events.
Agree, you don't really want to pre-aggregate your temporal data, or it will effectively only aggregate at each row-time boundary and the value is lower than just keeping the individual calculations.
We've not been developing v2 with ML feature serving in mind so far, but I would love to speak with anyone interested in this use case and figure out where the gaps are.
Snapshots don’t have to be at regular intervals and can be at whatever resolution you choose. You could snapshot as the first step of training then keep that snapshot for the life of the resulting model. Or you could use some other time travel methodology. Snapshots are only one of many options.
Pardon the jargon. But it is a necessary addition to the vocabulary.
To evaluate if a feature is valuable, you could attach the value of the feature to past inferences and retrain a new model to check for improvement in performance.
But this “attach”-ing needs the feature value to be as of the time of the past inference.
That’s the point of this subthread though. What’s the new thing Chronon is doing? It can’t just be point in time features because that’s already a thing.