Let's try an example: `average page views in the last 1, 7, 30, 60, 180 days` Yo...

mulmen · on April 9, 2024

I admit this is a bit outside my wheelhouse so I’m probably still missing something.

Isn’t this just a table with 5bn rows of timestamp, page_type, page_views_t1d, page_views_t7d, page_views_t30d, page_views_t60d, and page_views_t180d? You can even compute this incrementally or in parallel by timestamp and/or page_type.

What’s the magic Chronon is doing?

better365 · on April 10, 2024

For offline computation, it is okay with the table with 5bn rows. But for online serving, it would be really challenging to serve the features at a few milliseconds.

But even for offline computation, for the same computation logic, the code will be duplicated in lots of places. we have observed the ML practitioners copied sql queries all over. In the end, it is not possible for debugging, feature interpretability and lineage.

Chronon abstracts all those away so that ML practitioners can focus on the core problems they are dealing with, rather than spending time on the ML Ops.

For an extreme use case, one user defined 1000 features with 250 lines of code, which is definitely impossible with SQL queries, not to even mention the extra work to serve those features.

mulmen · on April 10, 2024

How does Chronon do this faster than the precomputed table? And in a single docker container? Is it doing logically similar operations but just automating the creation and orchestration of the aggregation tasks? How does it work?

better365 · on April 11, 2024

We utilize a lambda architecture, which incorporates the concept of precomputed tables as well. Those precomputed tables store intermediate representation of the final results. These precomputed tables are capable of providing snapshot or daily accuracy features. However, when it comes to real-time features that require point-in-time correctness, using precomputed tables may present challenges.

For the offline computations, we will reuse those intermediate results to avoid calculation from the beginning again. So the engine can actually scale sub-linearly.

mulmen · on April 11, 2024

Thanks. How does Chronon serve the real-time features without precomputed tables?

throwaway2037 · on April 10, 2024

This is good post. You had me until this part:

    > So you can do all that, or you can write a few lines of python in Chronon.

It all seems a bit handvwavy here. Will Chronon work as well as the SQL version or be correct? I vote for an LLM tool to help you write those queries. Or is that effectively what Chronon is doing?

better365 · on April 10, 2024

For correctness, yes, it works as well as the SQL version. And the aggregation can be extensible for other operations easily. For example, we have an operation of last, which is not even available in standard SQL.

mulmen · on April 10, 2024

I’ll stop short of calling comparisons to standard SQL disingenuous but it’s definitely unrealistic because no standard SQL implementation exists.

What does this “last” operation do? There’s definitely a LAST_VALUE() window function in the databases I use. It is available in Postgres, Redshift, EMR, Oracle, MySQL, MSSQL, Bigquery, and certainly others I am not aware of.

better365 · on April 11, 2024

That's fair.

Actually, Last is usually called last_k(n), so that you can specify the number of the values in the result array. For example, if the input column is page_view_id and n = 300, it will return the last 300 page_view_id as an array. If a window is used, for example, 7d, it will truncate the results to the past 7d. The LAST_VALUE() seems to return the last value from an ordered set. Hope that helps. Thanks for your interests.

mulmen · on April 11, 2024

In SQL we do that with a RANK window function then apply a filter to that rank. It can also be done with a correlated subquery.