- "Depots" are event streams (for event sourced data repositories)
- ETL read one or more streams and project them to indexable read models...
- Which read models are called "PStates" and represent nested combinations of indices like hashtables, b-trees, linked lists and so on. The point of those being they have the data in fast to query way.
- And you have query engine which splits a query into 1+ index sub-queries and then aggregates.
Am I missing something, this seems relatively standard event-sourced / CQRS-like architecture, but streamlined to avoid redundancy and reimplementation of common abstractions.
It would've helped if the terms were less obscure than "depots" and "PStates".
Individually, none of these concepts are new. I’m sure you’ve seen them all before. You may be tempted to dismiss Rama’s programming model as just a combination of event sourcing and materialized views. But what Rama does is integrate and generalize these concepts to such an extent that you can build entire backends end-to-end without any of the impedance mismatches or complexity that characterize and overwhelm existing systems.
You have the general model correct, but here are a few clarifications:
- PStates are partitioned, durable, replicated indexes that are represented as arbitrary combinations of data structures. A PState can be as simple an an integer per partition, or it can be complex like a map of lists of maps of sets. PStates allow you to shape your indexes to perfectly match your application's use cases.
- I wouldn't call Rama queries an "engine", as it's considerably more straightforward in how it works than something like SQL. The base query API is called "paths", which are an imperative way to concisely reach into one partition of one PState to fetch or aggregate values. There's also "query topologies" which are predefined, on-demand distributed computations that can fetch and aggregate data from many partitions of many PStates.
Thanks, I will read more soon! I'm curious... how do you resolve the "impedance mismatch" between some "canonical" models that business decisions are made, based upon, which need to be synchronous with the depots (and mutually synchronous with other models sharing fragments of the same data), and the eventually consistent read models, which have a more lax constraint on how up to date they are?
How do you ensure consistency here? How do you organize it in the data flow?
Say I update a user, because that user seems to still be there in the query result/indexes, but actually an event for this user being deleted has happened some time ago?
This can also happen I suppose of the depots run queries themselves on PState in order to determine if a certain event is valid at all or not, and how exactly to carry it out.
The impedance mismatches you're used to from using databases are gone because:
- You can finely tune your indexes to be exactly the optimal shape for your application (data structure). You can see this in our Mastodon implementation with the big variety of data structures we used for all the use cases.
- You're generally just using regular Java objects everywhere: appending to depots, during ETL processing, and stored in indexes.
How you coordinate data creation with view updates is a deeper topic, so I'll just summarize one of the basic mechanisms Rama provides for coordinating this. Depot appends can have an "ack level" that determines the conditions before Rama tells you that depot append has completed. The default level is "full ack" which includes all streaming topologies colocated with that depot fully processing that record. With this level, when the depot append completes you know that all associated indexes (PStates) have been updated.
There's also "append ack", which only waits for the depot append to be replicated on the depot, and "no ack", which is fire and forget. These all have their uses depending the specific needs of an application.
Thanks! So we can see these ACKs as "wait and synchronize" signals I suppose? However how can we ensure an "all or nothing" between all parties trying to ACK a conditions they're mutually dependent on? I.e. transactionality or atomicity?
Systems that promise "free linear scaling" without qualifiers either withhold or have not analyzed/realized their bottlenecks yet. Say if there is eventual consistency maybe the "eventuality" becomes so long that the service fails at its purpose. Or the communication link bandwidth is exhausted between key business logic (mutation event generating) services, and so on.
The only systems that scale linearly are stateless systems. Mastodon is not stateless. And even stateless systems hit some bottlenecks eventually, as they exist and run in a scale-variant Universe.
So this claim by itself doesn't immediately impress me, just turns my red lights on, awaiting further investigation. But we can of course discuss why this claim is made and how is it supported. The article is long so I've not had the chance to read it entirely yet.
But we have X number of event streams mapped through Y number of ETLs to produce Z number of read model indices, in a shape that seems to form a highly interlinked DAG, which eventually loops back on itself in terms of message flow. Just the increased cross-chatter here as we introduce more features suggests non-linear scaling.
for example it can scale the way persistent data structures scale, which is to say "O(1) within target operational bounds" despite technically being log-n with high branch factor)
- "Depots" are event streams (for event sourced data repositories)
- ETL read one or more streams and project them to indexable read models...
- Which read models are called "PStates" and represent nested combinations of indices like hashtables, b-trees, linked lists and so on. The point of those being they have the data in fast to query way.
- And you have query engine which splits a query into 1+ index sub-queries and then aggregates.
Am I missing something, this seems relatively standard event-sourced / CQRS-like architecture, but streamlined to avoid redundancy and reimplementation of common abstractions.
It would've helped if the terms were less obscure than "depots" and "PStates".