Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Can I describe a data queueing problem that I feel like there is a specific data (or queue) structure for, but that I don't know the name is?

Let's say you are trying to "synchronize" a secondary data store with a primary data store. Changes in the primary data store are very "bursty", one row will not change for days, then it'll change 300 times in a minute. You are willing to trade a bit of latency (say 10 seconds) to reduce total message throughput. You don't care about capturing every change to the primary, you just want to keep it within 10 seconds.

It feels like there should be a clever way to "debounce" an update when another update overrides it 500ms later. I know debounce from the world of front-end UI where you wait a little when doing autocomplete search based on keyboard input so as not to overwhelm the search.



There are a variety of solutions to this but CRDTs are a very good one (CRDTs solve your problem (https://en.wikipedia.org/wiki/Conflict-free_replicated_data_...). If the operations you're doing commute (that is a ○ b = b ○ a, e.g. the order in which you apply the operations doesn't matter) then one could apply all operations in parallel, and only send the final result of doing them all. Casandra uses LWW-Element-Set CRDTS to solve this exact problem (https://cassandra.apache.org/doc/latest/cassandra/architectu...).


Ideally you have a reliable Change Data Capture (CDC) mechanism like a Binlog Reader. Debezium, for example, can write directly to a queue like Kafka. A Kafka consumer picks up the events and writes to your secondary datastore. Something like that can probably handle all of your events without bucketing them, but if you want to cut down the number of messages written to the queue you can add that logic into the Binlog Reader so it emits a burst every 5 seconds or so. During those 5 seconds it buffers the messages in the processes memory or externally in something like Redis using a key so only the latest message is stored for a given record.


If you want everything to be within 10 seconds, then you build a state-change tracker (which only tracks all state changes since last update) and then you send the updates every 10 seconds.

Don't worry about debouncing - the state tracker should handle representing the 300 updates as a single state change, and if there are more then they just get sent in the next update 10 seconds later.


Sentry used Redis to buffer writes for a similar use case:

https://blog.sentry.io/2016/02/23/buffering-sql-writes-with-...


I've seen this done with kinesis streams.

Basically just update a cache, and forward the results every so often.

If you put things in a stack, you can keep using the most recent for the update. Compare times and you won't add a bad value




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: