I like the idea of using “fat” events as a means of data integration between systems.
Calling them RESTful events is a good term as I’ve pitched in much the same fashion. Every time a system responds to a create/update/execute request, drop an equivalent message into an event stream. Initially these can form the basis of a streaming ETL to whatever systems you use for analytics, but as new internal systems are brought online (e.g a new CRM system to help sales manage all the leads signing up for your successful product) the integration is already there, waiting. Just add a new consumer to the existing stream.
Overtime, you end up with a “data mesh”. If you look past the marketing of the term, it’s basically just means each application in your company should publish both sync (for real-time usage) and async (for data sharing between systems of record) interfaces.
The question at hand will always reduce down to the "Two Generals problem" [0]. The outbox pattern is a nice separation of concerns and gives at least once consistency as long as the forwarding happen before writing the event as processed in the outbox. Both side effects happening atomically through all failure cases is impossible.
To solve that issue you need a cooperating destination for your writes which handles de-duplication. Then you get into the weeds of causality, ordering and idempotence. Ugh.
For some real world examples see the AWS Kinesis documentation which says that any application using Kinesis must be able to handle duplicate records [1].
> There are two primary reasons why records may be delivered more than one time to your Amazon Kinesis Data Streams application: producer retries and consumer retries. Your application must anticipate and appropriately handle processing individual records multiple times.
It is possibly that you could get inconsistencies between systems, but you mitigate that with durable event streams (e.g. Kafka) and by ensuring that each data entity in your enterprise has a system of record that can be used to resolve conflicts.
This is really not that different than using ETL jobs to move data about, but taking a streaming approach vs a batch approach.
> It is possibly that you could get inconsistencies between systems, but you mitigate that with durable event streams (e.g. Kafka) and by ensuring that each data entity in your enterprise has a system of record that can be used to resolve conflicts.
The event stream layer isn't where the sync problems have arised in the systems I've worked on. It's the "commit transactionally both to your database and the event stream" part. Not a lot of systems are built to be ready to roll back the database change if the event publish fails. Or to be able to handle duplicate events if you error the other way.
That would work. The internal consistency will be at least once and you do de-duplication to handle message reliability through the database then. In other words, you need to uniquely identify each message to ensure idempotency since you will have duplicate writes of the same message to the database.
Just need to make sure to not mess up causal ordering between events because of out of order retries, if such things are important for your application.
Reading sequentially can be hard here, especially depending on throughput and how well you can or can't shard (e.g. how wide is the radius of possible side effects).
E.g. what if one event is something like "credit here, debit there" - you need to process it sequentially for both sides!
I guess if the concern is around atomically committing to the DB and event stream at the same time, you could hang the event stream off the DB using change-data-capture to populate the events.
Ultimately, whenever you are pushing data between systems you can end up with inconsistencies which why it’s important to clearly define systems of record.
That's basically what the database would do anyway, so generally, yes (e.g. even with a RDBMS it's internally got some sort of write log, in most cases). That's gonna help a lot!
It doesn't give you any guarantees about "at a given time" consistency, though. Maybe your queue or your event processor got backed up. So how real-time do you need both sides?
These are, of course, problems that people have solved to varying level of satisfaction for most use cases! But I've seen systems built by people who saw the blog post about "have both!" and then didn't think about all these cases and then... it gets messy.
If something is relying on the data in the queue to trigger a request to a service using the DB, the requestee may not have the full data yet.
The pattern I usually see is wiring this the other way - use an initial event as the trigger to write to the DB, while also streaming the DB's binlog or equivalent back as events. Now the risk is that another service gets a "too fresh" view but this is usually less harmful than a stale view. Things listening to the binlog events need to process them idempotently, but this is usually not a major complication since most queue designs will require you to be prepared for that anyway.
Typically you see these suggested as part of a "change data capture" type process where the event is only published after the action is committed to the data store. The downside (IMO) is this required directly integrating with the data store itself which isn't always easy to do, or obvious from a git/CICD perspective.
Possibly, but it records the complete information about actions to be performed on some system, which presumably is what the "fat events" were supposed to be about (unless I misinterpreted it somehow).
To the extent a fat event is a "RESTful" event, it should definitely not be about the actions but about the current state of things. Now, maybe it only emits when that state has changed, but that's distinct from what's in the message.
What exactly a WAL contains depends on the specific storage, but it's commands at least as often as it's state - probably more often since you want your WALs to be tiny and they often also facilitate rollback.
In general you wouldn't want to emit a fat event until the data is firmly committed to its system of record which is usually not the event system itself; in that use case it will be a kind of 'write behind' log.
I tend to use whatever types of events the system provides, in realtime. In some cases, it's a real pain, as I can't really choose what I get.
If I am using an event as a "record" of an event, I tend to store a "before" state, and an "after" state. In many cases, I can actually store entire serialized objects. This allows, for example, the ability to recreate an object state/context, simply by re-instantiating the stored object, and using the reanimated instance to replace the current one.
Sort of a RESTFul event.
In my work, whenever possible, I like to use a "RefCon" pattern. This is a classic pattern, that I learned from working with Apple code, but I know that it's a lot older.
We simply attach a "RefCon" (Reference Context) to an event. This is usually an untyped value, that is supplied to the event source, when it is set up, then, to the event target, when the event fires. These days, it's easier, as closures/lambdas often allow you to access the initiating context. In the old days, it might have simply been a single-byte value, that could be applied into a lookup table, to allow the called context to re-establish.
I recently ran into the need for one of these, in an SDK that I wrote, where I didn't. I'm still kicking myself. I think that I didn't supply it, because I thought it might have been too complicated for some consumers of the SDK.
In the six years or so, since I wrote that SDK, I have been the only consumer.
I found that GraphQL subscriptions give a great balance between trigger/signal and fat/REST events in complex environments.
The events that a subscriber is interested in are decorated with the exact data the subscriber specifies at subscription time, avoiding an API call while providing exactly the level of richness the client requires.
I think the thing about the “exact level of richness required” is hugely important. It’s easy to run into the problem of needing to pull in all the data that any consumer would need (set union), but graphql allows each consumer to only grab what it needs. I’m not sure what to call this decoupling but it’s a huge win.
It is, and because a properly built graphql backend will only fetch the data requested, the benefits of this approach extend far beyond the ergonomics of API access and can significantly improve the performance of the system.
I’m a little surprised that Graphql subs aren’t more widely talked about.
I think what the author named "Domain events" are equivalent to derivatives, "Fat events" are equivalent to step-wise integration, and "Trigger events" represent steps on a function of data on the time domain.
I like this ontology; we've divided our events into roughly similar categories but these are better names than we use.
I'd further advance that
- Trigger events and RESTful events are roughly the same thing, just a question of what you choose in latency vs. size vs. schema flexibility space. We even have events in our system whose schema inlines data below a certain size but links above that.
- There is a fourth type: windowed reductions over domain events, e.g. the "changelog" in Kafka Streams. This bears a similar relation to domain events as the RESTful event does to trigger events.
If I understand your forth type correctly, it's like snapshotting in an event-sourced system. For me those are only internal and not meant to be communicated, rather they are there to speed up reconstitution of an event-sourced aggregate. I might be misunderstanding your point tho.
With regard to your first point, in terms of communication, they serve very similar integration patterns. I imagine you've standardised the consumption of these events, making it transparent if the event is linked or attached. Is that correct? In such cases, there's one difference, which is the out of sync state; the situation where the state has been altered after the signal was dispatched but before it was consumed. Is that something you deal with?
> I imagine you've standardised the consumption of these events, making it transparent if the event is linked or attached. Is that correct?
Yes, a pretty thin layer that returns the data if it was attached to the message or makes an HTTP request if not.
> the situation where the state has been altered after the signal was dispatched but before it was consumed. Is that something you deal with?
Our most common use case is trying to keep local caches of e.g. control plane data up-to-date, so retrieving something more recent than existed when the event was produced is usually a bonus. In the rare cases it is a concern we make sure the external data link is unique (e.g. into an S3 bucket with an expiration policy a bit longer than the event retention time).
> For me those are only internal and not meant to be communicated, rather they are there to speed up reconstitution of an event-sourced aggregate. I might be misunderstanding your point tho.
I think you've got the point. As these are implemented in Kafka Streams they are also an event source themselves which can be consumed like any other topic's messages.
I've worked on a real-time dashboard for operational insights for an airport once where they initially used eventstore to do effectively the same thing. We ended up replacing eventstore with a similar setup using typescript and kafka, instead of storing the projections into a new topic we used a mysql and redis as a storage backend. Worked quite well, especially in the cases where there was some more interaction needed with external systems. I haven't worked with Kafka Streams yet, but sure sounds like a worthwhile investment to spend some time with.
If you are using "trigger events" for security reasons, wouldn't it be more appropriate to send an ephemeral "eventID" rather than an actual "orderID"? Otherwise you may be leaking order IDs, which could potentially carry useful information for an attacker.
Any exposed information should be exposed deliberately. Although I didn't cover it in this post. In my event-driven setups, events are always wrapped in a message envelope data structure. Where a message is contains 1) the event/payload and 2) a set of headers to carry more "system-y" data, like time of recording, system of origin, and other observability related information. The event ID is one of those values.
In general you need to send something which can be used to look up the data in a more appropriate (e.g. authenticated or deletable) system. A true event ID may not be available until after the message is successfully produced; anything generated before that is tantamount to the order ID anyway.
Note that "sensitive" in this case can mean a persistence/retention boundary and not a security boundary.
I'm not sure. We have distributed systems of this type at $JOB and a correlation ID that is generated on user action and attached to every event in the chain (regardless of service) is plenty. I don't think we've ever needed to look at anything so specifically that a unique event ID for each event would have been helpful.
When domain events are the "trigger", RESTful events can exist without an explicit "trigger" type of event. If you look at the payload, they are different, but sometimes one is used for the same purpose as the other, internally.
Lots of REST systems have ways to denote a deleted resource (e.g. `{ "data": null }`) and some event systems too (e.g. Kafka, empty body and length -1), so I don’t see why you need trigger events to handle deletion.
Calling them RESTful events is a good term as I’ve pitched in much the same fashion. Every time a system responds to a create/update/execute request, drop an equivalent message into an event stream. Initially these can form the basis of a streaming ETL to whatever systems you use for analytics, but as new internal systems are brought online (e.g a new CRM system to help sales manage all the leads signing up for your successful product) the integration is already there, waiting. Just add a new consumer to the existing stream.
Overtime, you end up with a “data mesh”. If you look past the marketing of the term, it’s basically just means each application in your company should publish both sync (for real-time usage) and async (for data sharing between systems of record) interfaces.