Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you look for a ready-to-use open-source implementation of CDC for Postgres (and other databases), take a look at Debezium [1].

On audit logs in particular, we have a post on our blog, which discusses how to use CDC for that, focusing in particular on enriching change events with additional metadata like business user performing a given change by means of stream processing [2].

One advantage of log-based CDC over trigger-based approaches is that it doesn't impact latency of transactions, as it runs fully asynchronously from writing transactions, reading changes from the WAL of the database.

Disclaimer: I work on Debezium

[1] debezium.io [2] https://debezium.io/blog/2019/10/01/audit-logs-with-change-d...



Debezium is awesome, thanks for the great work! It's in my toolbox for when embulk [1] batch processing doesn't cut it (or even in combination with).

https://github.com/embulk/embulk


Thank you so much, it's always awesome to hear that kind of feedback!


Gunnar,

I've been thinking about your "transaction" pattern more, and I was wondering - Why don't you stream the transaction metadata directly to a kafka "transaction_context_data" topic? Would writing some of the data to the db while writing some of the data directly to kafka make it less consistent during faults?

The reason I ask: I'm curious what it would look like to use this pattern for an entire application, I think it could be a very powerful way todo "event sourcing lite" while still working with traditional ORMs. Would writing an additional transaction_metadata row for most of the application insert/updates slow things down? Too many writes to that table?


> Would writing some of the data to the db while writing some of the data directly to kafka make it less consistent during faults?

Yes, this kind of "dual writes" are prone to inconsistencies. If either the DB transaction rolls back, or the Kafka write fails, you'll end up with inconsistent data. Discussing this in a larger context in the outbox pattern post [1]. This sort of issue is avoided when writing the metadata to a separate table as part of the DB transaction, which either will be committed or rolled back as one atomic unit.

> Would writing an additional transaction_metadata row for most of the application insert/updates slow things down?

You'd just do one insert into the metadata table per transaction. As change events themselves contain the transaction id, and that metadata table is keyed by transaction id, you can correlate the events downstream. So the overhead depends on how many operations you in your transactions already. Assuming you don't do just a single insert or update, but rather some select(s) and then some writes, the additional insert into the transaction metadata table typically shouldn't make a substantial difference.

Another option, exclusive to Postgres, would be to use write an event for the transaction metadata solely to the WAL using pg_logical_emit_message(), i.e. it won't be materialized in any table. It still can be picked up via logical decoding, we still need to add support for that record type to Debezium though (contributions welcome :).

[1] https://debezium.io/blog/2019/02/19/reliable-microservices-d...


> kafka

Except you already have postgres, why add another thing to it?


A benefit of Kafka is that it is relatively trivial to spin up multiple consumers from a single CDC stream.

If you have some user attribute table that you set up with Debezium, there are a couple downstream consumers (teams!) that also want this data. So you hook up the table with debezium to Kafka and write all changes to a Kafka log.

A team of data scientists can stream these straight into their model, or a team of data engineers can dump these changes real time into an analytics store for business intelligence, or another team that needs access to this data can also create their own denormalized table of user attributes as well.

For the data producing team, once they get their data into Kafka basically any team can consume from this data which is a lot easier to maintain for the producing team and does not stress the database so its a very nice pattern for bigger orgs.

For the most part, keeping the data in a single postgres instance as long as possible would be my recommended solution. But there is a point where getting the data to consumers of that data becomes tedious enough that this additional infrastructure is worth it.


That is very cool.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: