This was actually pretty fascinating to me. On one hand, I am astonished at how long it takes to perform seemingly trivial git operations on repositories at this scale. On the other hand, I'm utterly mystified that a company like Facebook has such monolithic repositories. Even back when I was using SVN a lot, I relied on externals and such to break up large projects into their smaller service-level components.
I'd be very interested to see some benchmarks on their current VCS solution for repositories of this scale.
From a followup post: “We already have some of the easily
separable projects in separate repositories, like HPHP. If we could split our largest repos into multiple ones, that would help the scaling issue. However, the code in those repos is rather interdependent and we believe it’d hurt more than help to split it up, at least for the medium-term future. We derive a fair amount of benefit from the code sharing and keeping things together in a single repo, so it's not clear when it’d make sense to get more aggressive splitting things up.”
He notes that these repositories are somewhat broken up already, and wants to keep them together.
There are good reasons to keep code in one repository; particularly, git's submodule support has a number of nasty interface tradeoffs; I wouldn't say it breaks git, but you have to keep a clear understanding of all your submodules in your head when you have a lot of them.
OK, it pretty much breaks git to have submodules that are interdependent. I know this because I am currently moving one of my organizations off this exact plan -- it's the opposite of useful and speedy to have to worry about versions across a large number of backend / frontend repositories.
It is MUCH easier and therefore better for developers to put them together, and release together.
"We can build a binary that is more than 1GB (after stripping debug information) in about 15 min, with the help of distcc. Although faster compilation does not directly contribute to run-time efficiency, it helps make the deployment process better."
Not sure if it's still the case, but Google hosts all their internal source code on a modified version of perforce, so they essentially have everything in one repo.
What do Facebook and the National Institutes of Health have in common? I'm pretty sure this will end with Facebook building their own versioning system from scratch and give it some kitchsy name like "Retro".
I'd be very interested to see some benchmarks on their current VCS solution for repositories of this scale.