(Author here) Fair enough. I agree .25^20 is basically infinitesimal, and even w...

techcode · 2025-02-20T18:59:41 1740077981

With Kafka you normally don't pick a worker - Kafka does that. IIRC with some sort of consistent hashing - but for simplicity sake lets say it's just modulo 'messageID % numberOfShards'.

You control/configure numberOfShards - and its usually set to something order of magnitude bigger than your expected number of workers (to be precise - that's number of docker pods or hardware boxes/servers) - e.g. 32, 64 or 128.

So in practice - Kafka assigns multiple shards to each of your "workers" (if you have more workers than shards then some workers don't do any work).

And while each of your workers is limited to one thread for consuming Kafka messages. Each worker can still process multiple messages at the same time - in different async/threads.

techcode · 2025-02-20T19:45:03 1740080703

To me it seems like your underlying assumptions is "1 worker can only work on one message/item at a time", right?

While you could also use Kafka like that - and it might even work for your use case, as long as you configure option (sorry forgot the name) that makes Kafka redistribute shards because particular workers/consumers are too slow.

AFAIK the usual way is for each worker to get more than one message/item at a time, and do the actual item/work in/through separate thread/work pool (or another async mechanism).

Kafka then keeps track of which messages were picked up by each worker/consumer, and how big is the gap between that and committed offset (marked as done).

It gets a bit more tricky if you: - can't afford to process some messages/work again (well at extreme end it might actually be a show stopper for using Kafka) - need to have automatic retry on error/fail, how quickly/slowly you want to retry, how many times to retry...etc. - can you afford to temporarily "lose" some pending (picked up from Kafka but offset not marked as done) items for random things (worker OOMKILLED, solar flare hit network cable ...)

We've actually solved some of these with simply having another (set of) worker(s) that consume same topic with a delay (imagine cron job that runs every 5 minutes). And doing things in case there's no record of task being done, putting it into same topic again for retry ...etc.