> As luck would have it, the Arrow spec was also finally making progress with adding the long anticipated German Style string types to the specification. Which, spoiler alert, is the type we implemented.
Congratulations to the team, Pydantic is an amazing library.
If you find JSON serialization/deserialization a bottleneck, another interesting library (with much less features) for Python is msgspec: https://github.com/jcrist/msgspec
Are there any necessary features that you've found missing in msgspec?
One of the design goals for msgspec (besides much higher performance) was simpler usage. Fewer concepts to wrap your head around, fewer config options to learn about. I personally find pydantic's kitchen sink approach means sometimes I have a hard time understanding what a model will do with a given json structure. IMO the serialization/validation part of your code shouldn't be the most complicated part.
The biggest issue missing from most conversion and validation libraries is creating models from JSON Schema. JS is ideal for central, platform agnostic single source of truth for data structures.
In my use case, I find the lack of features of msgspec more freeing in the long run. Pydantic is good for prototyping, but with msgspec I can build nimble domain specific interfaces with fast serial/deserialisation without having to fight the library. YMMV!
I don't know what were the issues Yelp was facing, I have upgraded several times my past Kafka clusters and really never experienced any issues. Normally the upgrade instructions are documented (e.g. https://kafka.apache.org/31/documentation.html#upgrade) and the regular rolling upgrade comes with no downtimes.
Besides this, operating Kafka never required much effort a part when we needed to re-balance partitions across brokers. Earlier versions of Kafka required to handle it with some external tools to avoid network congestions, but I think this is part of the past now.
On the other hand, Kafka still needs to be used carefully, especially you need to plan topics/partitions/replications/retention but that really depends by the application needs.
I used to work in a project where every other rolling upgrade (Amazon’s managed Kafka offering) would crash all the Streams threads in our Tomcat-based production app, causing downtime due to us having to restart.
The crash happens 10–15 minutes into the downtime window of the first broker. Absolutely no one has been able to figure out why, or even to reproduce the issue.
Running out of things to try, we resorted to randomly changing all sorts of different combinations of consumer group timeouts, which are imho poorly documented so no one really understands which means which anyway. Of course all that tweaking didn’t help either (gunshot debugging never does).
This has been going on for the last two years. As far as I know, the issue still persists. Everyone in that project is dreading Amazon’s monthly patch event.
Check the errors coming back on your poll/commit. The kafka stack should tell you when you can retry items. If it is in the middle of something sometimes it does not always fail nicely but you can retry and it is usually fine. Usually I see that sort of behavior if the whole cluster just 'goes away' (reboots, upgrades, etc). It will yeet out a network error and then just stop doing anything. You have to watch for it and recreate your kafka object (sometimes, sometimes retry is fine). If they are bouncing the whole cluster on you each broker can take a decent amount of time before they are alive again. So if you have 3 and they restart all 3 in quick succession all at once you will see some nasty behavior out of the kafka stack. You can fiddle your retries and timeouts. However, if that is lower than it takes for the cluster to come back you can end up with what looks like a busted kafka stream. I have seen it take anywhere from 3-10 mins for a single broker to restart sometimes (other times it is like 10 seconds). So depending on the upgrade/patch script that can be a decent outage. It goes really sideways if the cluster has a lot of volume to replicate between topics on each broker (your replica factor).
One thing that is underestimated is keep the tools version in sync between your app dev dependencies and pre-commit. This also includes plugins for specific tools (for instance flake8). A solution would be to define the hooks in pre-commit to run the tools inside your venv.
About typings: I agree the eco-system is not mature enough, especially for some frameworks such as Django, but the effort is still valuable and in many cases the static analysis provided by mypy is more useful than not using it at all. So I would suggest to try do your best to make it work.
We are currently using Kafka Manager (https://github.com/yahoo/kafka-manager). It seems Kafka HQ have the same features (it is not clear from the doc if you can actually manage the partition arrangement or only view it) plus the possibility to view the actual messages that are streaming in Kafka that might be very useful.
View message inside Kafka was my main goal when building KafkaHQ !
Unfortunately, you can't, for now, reassign the partitions, but maybe I will can add the feature when this one will be ready : https://cwiki.apache.org/confluence/display/KAFKA/KIP-455%3A... (targeted for Kafka 2.4, but I don't think this was release)
You can only set the number of partitions when you create a topic in KafkaHQ, not after the fact. You can view messages, even if they are AVRO encoded and backed by a schema registry.
Much of the effort for the next release of Druid (0.9) has been put into docs. Even with that you still need to handle a complex cluster made of several services: zookeeper, druid overlord, druid middlemanager, druid broker, etc... still it sounds more complicate than it actually is. What was the alternative solution adopted by your company?
From the article:
> As luck would have it, the Arrow spec was also finally making progress with adding the long anticipated German Style string types to the specification. Which, spoiler alert, is the type we implemented.