> Even strategies like parallel rsyncs had their limits.
They don't really go into detail as to what limitations they hit by pushing code to servers instead of pulling. Does anyone have any ideas as to what those might be? I can't think of any bottlenecks that wouldn't apply in both directions, and pushing is much simpler in my experience, but I've also never been involved with deployments at this scale.
I can't speak for Slack, but it's not unreasonable to believe that a single machine's available output bandwidth (~10-40Gbps) can be saturated during a deploy of ~GB to hundreds of machines. Pushing the package to S3 and fetching it back down lets the bandwidth get spread over more machines and over different network paths (e.g. in other data centers)
We do it similarly except we push an image to a docker registry (backed by multi-region S3), then you can use e.g. ansible to pull it to 5, 10, 25, 100% of your machines. It "feels" like push though, except that you're staging the artifact somewhere. But when booting a new host it'll fetch it from the same place.
Considering they are not bringing machines out of rotation or draining connections in the example given with the errors, I assume that more than 10 machines produces too many errors or takes too long to have two versions of the code deployed, and wherever they pull from is not scalable. All those problems can be easily solved though.
They don't really go into detail as to what limitations they hit by pushing code to servers instead of pulling. Does anyone have any ideas as to what those might be? I can't think of any bottlenecks that wouldn't apply in both directions, and pushing is much simpler in my experience, but I've also never been involved with deployments at this scale.