You may want to take a look at Starrocks [1]. It is an open-source DB [2] that competes with Clickhouse [3] and claims to scale well – even with joins – to handle use cases like real-time and user-facing analytics, where most queries should run in a fraction of a second.
Hey! I work on the ML Feature Infra at Netflix, operating a similar system to Chronon but with some crucial differences. What other alternatives aside from Starrocks did you evaluate as potential replacements prior to building Chronon? Curious if you got to try Tecton or Materialize.com.
We haven’t tried materialize - IIUC materialized is pure kappa. Since we need to correct upstream data errors and forget selective data(GDPR) automatically - we need a lambda system.
Tecton, we evaluated, but decided that the time-travel strategy wasn’t scalable for our needs at the time.
A philosophical difference with tecton is that, we believe the compute primitives (aggregation and enrichment) need to be composable. We don’t have a FeatureSet or a TrainingSet for that reason - we instead have GroupBy and Join.
This enables chaining or composition to handle normalization (think 3NF) / star-schema in the warehouse.
Side benefit is that, non ml use-cases are able to leverage functionality within Chronon.
FeatureSets are mutable data and TrainingSets are consistent snapshots of feature data (from FeatureSets). I fail to see what that has to do with composability.
Join is still available for FeatureSets to enable composable feature views - join is resuse of feature data. GroupBy is just an aggregation in a feature pipeline, not sure your point here.
You can still do star schema (and even snowflake schema if you have the right abstractions).
Normalization is a model-dependent transformation and happens after the feature store - needs to be consistent between training and inference pipelines.
That evaluation would be an amazing addendum or engineering blog post! I know it’s not as sexy as announcing a product, but from an engineering perspective the process matters as much as the outcome :)
Let’s say you want to compute avg transaction value of a user in the last 90days. You could pull individual transactions and average during the request time - or you could pre compute a partial aggregates and re-aggregate on read.
OLAP systems are fundamentally designed to scale the read path - former approach. Feature serving needs the latter.
[1] https://www.starrocks.io/ [2] https://github.com/StarRocks/starrocks [3] https://www.starrocks.io/blog/starrocks-vs-clickhouse-the-qu...