If you look for a ready-to-use open-source implementation of CDC for Postgres (a...

tomnipotent · on Oct 27, 2021

Debezium is awesome, thanks for the great work! It's in my toolbox for when embulk [1] batch processing doesn't cut it (or even in combination with).

https://github.com/embulk/embulk

gunnarmorling · on Oct 27, 2021

Thank you so much, it's always awesome to hear that kind of feedback!

tomhallett · on Oct 27, 2021

Gunnar,

I've been thinking about your "transaction" pattern more, and I was wondering - Why don't you stream the transaction metadata directly to a kafka "transaction_context_data" topic? Would writing some of the data to the db while writing some of the data directly to kafka make it less consistent during faults?

The reason I ask: I'm curious what it would look like to use this pattern for an entire application, I think it could be a very powerful way todo "event sourcing lite" while still working with traditional ORMs. Would writing an additional transaction_metadata row for most of the application insert/updates slow things down? Too many writes to that table?

gunnarmorling · on Oct 27, 2021

> Would writing some of the data to the db while writing some of the data directly to kafka make it less consistent during faults?

Yes, this kind of "dual writes" are prone to inconsistencies. If either the DB transaction rolls back, or the Kafka write fails, you'll end up with inconsistent data. Discussing this in a larger context in the outbox pattern post [1]. This sort of issue is avoided when writing the metadata to a separate table as part of the DB transaction, which either will be committed or rolled back as one atomic unit.

> Would writing an additional transaction_metadata row for most of the application insert/updates slow things down?

You'd just do one insert into the metadata table per transaction. As change events themselves contain the transaction id, and that metadata table is keyed by transaction id, you can correlate the events downstream. So the overhead depends on how many operations you in your transactions already. Assuming you don't do just a single insert or update, but rather some select(s) and then some writes, the additional insert into the transaction metadata table typically shouldn't make a substantial difference.

Another option, exclusive to Postgres, would be to use write an event for the transaction metadata solely to the WAL using pg_logical_emit_message(), i.e. it won't be materialized in any table. It still can be picked up via logical decoding, we still need to add support for that record type to Debezium though (contributions welcome :).

[1] https://debezium.io/blog/2019/02/19/reliable-microservices-d...

krageon · on Oct 27, 2021

> kafka

Except you already have postgres, why add another thing to it?

cfors · on Oct 27, 2021

A benefit of Kafka is that it is relatively trivial to spin up multiple consumers from a single CDC stream.

If you have some user attribute table that you set up with Debezium, there are a couple downstream consumers (teams!) that also want this data. So you hook up the table with debezium to Kafka and write all changes to a Kafka log.

A team of data scientists can stream these straight into their model, or a team of data engineers can dump these changes real time into an analytics store for business intelligence, or another team that needs access to this data can also create their own denormalized table of user attributes as well.

For the data producing team, once they get their data into Kafka basically any team can consume from this data which is a lot easier to maintain for the producing team and does not stress the database so its a very nice pattern for bigger orgs.

For the most part, keeping the data in a single postgres instance as long as possible would be my recommended solution. But there is a point where getting the data to consumers of that data becomes tedious enough that this additional infrastructure is worth it.

djbusby · on Oct 26, 2021

That is very cool.