> In the fall of 2016, we were dealing with hundreds of thousands of MySQL queri...

> In the fall of 2016, we were dealing with hundreds of thousands of MySQL queries per second and thousands of sharded MySQL hosts in production.

> Today, we serve 2.3 million QPS at peak. 2M of those queries are reads and 300K are writes.

I think the "today" QPS numbers are still doable with a properly tuned single-writer galera cluster running on machines with TBs of memory. Of course, with Slack workload, there would be too much historical data to fit into a single host, so I can see the reasons to shard into multiple clusters/hosts.

Still, the numbers seem a little off. Let's say back in fall 2016 there were already 200K write QPS at peak, with 200 sharded hosts accepting write. That's just 1K write QPS at peak per host on average, and let's say 20K write QPS at peak for a particularly hot shard. What could be the bottleneck? Replication lag? Data size? I don't think any of the articles from Slack has talked about this.

What Vitess provides is invaluable, especially the very solid implementation of secondary index. But sometimes I feel like it is being used/advocated as a sledgehammer ("just keep sharding") without looking at what could be done better at the lower MySQL/InnODB level, in exchange for a much more costly cloud bill.