You may want to take a look at Starrocks [1]. It is an open-source DB [2] that c...

nikhilsimha · on April 9, 2024

We did and gave up due to scalability limitations.

Fundamentally most of the computation needs to happen before the read request is sent.

jvican · on April 9, 2024

Hey! I work on the ML Feature Infra at Netflix, operating a similar system to Chronon but with some crucial differences. What other alternatives aside from Starrocks did you evaluate as potential replacements prior to building Chronon? Curious if you got to try Tecton or Materialize.com.

nikhilsimha · on April 9, 2024

We haven’t tried materialize - IIUC materialized is pure kappa. Since we need to correct upstream data errors and forget selective data(GDPR) automatically - we need a lambda system.

Tecton, we evaluated, but decided that the time-travel strategy wasn’t scalable for our needs at the time.

A philosophical difference with tecton is that, we believe the compute primitives (aggregation and enrichment) need to be composable. We don’t have a FeatureSet or a TrainingSet for that reason - we instead have GroupBy and Join.

This enables chaining or composition to handle normalization (think 3NF) / star-schema in the warehouse.

Side benefit is that, non ml use-cases are able to leverage functionality within Chronon.

jamesblonde · on April 9, 2024

FeatureSets are mutable data and TrainingSets are consistent snapshots of feature data (from FeatureSets). I fail to see what that has to do with composability. Join is still available for FeatureSets to enable composable feature views - join is resuse of feature data. GroupBy is just an aggregation in a feature pipeline, not sure your point here. You can still do star schema (and even snowflake schema if you have the right abstractions).

jamesblonde · on April 9, 2024

Normalization is a model-dependent transformation and happens after the feature store - needs to be consistent between training and inference pipelines.

nikhilsimha · on April 9, 2024

Normalization is overloaded. I was referring to schema normalization (3NF etc) not feature normalization - like standard scaling etc.

jamesblonde · on April 9, 2024

Ok, but star schema is denormalized. Snowflake is normalized.

nikhilsimha · on April 9, 2024

To be pedantic, even in star schema - the dim tables are denormalized, fact tables are not.

I agree that my statement would be much better if used snowflake schema instead.

throwaway2037 · on April 10, 2024

What is the meaning of pure kappa?

pedrosorio · on April 10, 2024

https://learn.microsoft.com/en-us/azure/architecture/databas...

jvican · on April 9, 2024

Thank you for sharing!

seattleeng · on April 9, 2024

That evaluation would be an amazing addendum or engineering blog post! I know it’s not as sexy as announcing a product, but from an engineering perspective the process matters as much as the outcome :)

esafak · on April 9, 2024

Please can you expand? What limitations, computations?

nikhilsimha · on April 9, 2024

Let’s say you want to compute avg transaction value of a user in the last 90days. You could pull individual transactions and average during the request time - or you could pre compute a partial aggregates and re-aggregate on read.

OLAP systems are fundamentally designed to scale the read path - former approach. Feature serving needs the latter.

esafak · on April 9, 2024

Does Chronon automatically determine what intermediate calculations should be cached? Does it accept hints?

nikhilsimha · on April 9, 2024

We don't accept hints yet - but we determine what to cache.