> Even strategies like parallel rsyncs had their limits. They don't really go in...

rbtying · on April 9, 2020

I can't speak for Slack, but it's not unreasonable to believe that a single machine's available output bandwidth (~10-40Gbps) can be saturated during a deploy of ~GB to hundreds of machines. Pushing the package to S3 and fetching it back down lets the bandwidth get spread over more machines and over different network paths (e.g. in other data centers)

zerd · on April 9, 2020

We do it similarly except we push an image to a docker registry (backed by multi-region S3), then you can use e.g. ansible to pull it to 5, 10, 25, 100% of your machines. It "feels" like push though, except that you're staging the artifact somewhere. But when booting a new host it'll fetch it from the same place.

one2know · on April 9, 2020

Considering they are not bringing machines out of rotation or draining connections in the example given with the errors, I assume that more than 10 machines produces too many errors or takes too long to have two versions of the code deployed, and wherever they pull from is not scalable. All those problems can be easily solved though.