Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I had a similar idea (except using kafka) : have all the nodes write to a kafka cluster, used for buffering, and let some consumer write those data in batch, into whatever database engine(s) you need for querying, with intermediate pre-processing steps whenever needed. This lets you trade latency for write buffering, while not loosing data thanks to kafka durability guarantees.

What would you use for streaming directly to s3 in high volumes ?



Yeah kafka would handle it, except in my experience i would like to avoid kafka if possible, since it adds complexity. (Fair enough it depends on how precious your data is, if it is acceptable to loose some of it if a node crashes)

But somehow they are ingesting the data over network. Would writing files to s3 be slower than that? Otherwise you don't need much more than a RAM buffer?

Edit: to be clear, kafka is probably the right choice here, it is just that kafka and me is not a love story.

But it should be cheaper to store long term data in s3 than storing it in kafka, right?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: