Common Crawl has a really neat mission, as there isn't a whole lot of free and open data out in the world right now and they're trying to change that. With this donation it looks like their commons will be augmented with some great stuff and that can only mean awesome things.
That was just for the crawl sample, yes, and was approximately 100M of data, though you can specify as much as you'd prefer.
The cool thing about running this job inside Elastic MapReduce right now is the ability to get at S3 data for free, and for cost of access outside of it, both pretty reasonable sums. Right now, you can analyze the entire dataset for around $150, and if you build a good enough algorithm you'll be able to get a lot of good information back.
We're working to index this information so you can process it even more inexpensively, so stay tuned for more updates!