I read this quite often, but we run a relatively small kafka cluster on GCP and it's pretty much hassle-free. We also run some ad-hoc clusters in kubernetes from time to time which also works well.
What exactly have you found complex about running Kafka?
>What exactly have you found complex about running Kafka?
I run small 2-node kafka cluster that processes to 10 million messages/hr - not large at all - it's very stable, for almost a year now. However what was complex was:
* Setup. We wanted it managed it by mesos/marathon, and having to figure out BROKER_IDs took a couple hours of trial and error.
* Operations. Adding queues and checking on consumers isn't amazing to do from the command line.
* Monitoring. It took a while before I settled on a decent monitoring solution that could give insight into kafka's own consumer paradigm. Even still there are more metrics I would like to have about our cluster that I don't care to put the time in to retrieve.
Another thing I found "complex" was the Java/Scala knowledge requirement. I wanted Kafka-like functionality for a Node.js project, but my limited Java and Scala knowledge made me concerned about my ability to deal with any problems I might run into.
In other words, I could probably get everything up and running (especially with the various Kafka-in-Docker projects I found), but what happens if (when) something goes wrong?
What do you mean by "Java/Scala knowledge requirement"? I don't know much c/c++ but I use postgres just fine. There is a bunch of stuff in software ecosystem in a bunch of languages that if I had to know it all I wouldn't progress much.
Like I mentioned our Kafka setup is relatively small - we moved from RabbitMQ to Kafka because of the sheer size (as in byte size) of the messages we needed to process (~10 million/hr), where each message could be 512-1024kb which caused RabbitMQ to blowup unpredictably.
Secondly, due to the difference in speed in the consumer and producer, we typically have an offset lag of around 10MM, and its important to monitor this lag for us because if it gets too high, then it means we are falling behind (our consumers scale up and down through the day to mitigate this).
Next, we use Go, which is not an official language supported by the project but has a library written by Shopify called Sarama. Sarama's consumer support had been in beta mode in a while, and in the past had caused some issues were every partition of a topic wasn't being consumed.
Lastly, at the time we thought creating new topics would be a semi-regular event, and that we might have dozens of them (this didn't pan out), but having a simple overview of the health of all of our topics and consumers was thought to be good too.
We found Yahoo's Kafka Manager[1], which has ended up being really useful for us in standing up and managing the cluster without resorting to the command line. It's been great, but it wasn't exactly super obvious for me to find at the time.
Currently the only metrics I don't have are things plottable things like processed/incoming msg/sec (per topic), lag over time and disk usage. I'm sure these are now easily ingested into grafana, I just haven't had the time to do it.
All of this information is great to have, but requires some setup, tuning, and elbow grease that is probably batteries included in a managed service. At the same time however, this is something you get almost out of the box with RabbitMQ's management plugin.
Yes, I do agree with these (except mesos is not a requirement for us). Is any of this significantly better for hosted Kafka or kinesis though? I have no experience with either
Topic sharding. The messages were pretty large and at the time we set this up we were on DO-like platform where the only way to get more disk space was to buy a larger instance. We didn't need the extra cpu power, but needed extra disk space, and it was cheaper to opt to two nodes instead of upgrading to n+2.
Running Kafka is just fine, the issues arise when a node fails, when you need to add data and re-partition a topic. However, it is not that hard once you know what to do, but Kinesis is simpler but it is expensive as shit.
At small scale Kinesis is far less expensive. There is definitely a point where Kinesis becomes more expensive, especially if you consider the operational and human costs involved.
I read this quite often, but we run a relatively small kafka cluster on GCP and it's pretty much hassle-free. We also run some ad-hoc clusters in kubernetes from time to time which also works well.
What exactly have you found complex about running Kafka?