I love common crawl, but as I commented before I still want to see a subset avai...

LisaG · on Nov 27, 2013

There will be news about a subset sometime next month!

malandrew · on Nov 28, 2013

Ideally beyond the top sites, these subsets would be available as verticals, so that people can focus on specialized search engines.

While it's nice to have generalist search engines, it would be even better to be able to unbundle the generalist search engines completely. Verticals such as the following would be nice:

1) Everything linux, unix and both

2) Everything open-source

3) Only news & current events

4) Popular culture globally and by country

5) Politics globally and by country

6) Everything software engineering

7) Everything hardware engineering

8) Everything maker community

9) Everything financial markets

10) Everything medicine / health (sans obvious quackery)

11) etc.

Maybe make a tool that allows the community to create the subset creation recipes that perform the parsing out of data of a certain type and that the community forks and improves over time.

The time to create a generalist search engine has sailed, but specialist search engines is total greenfield.

nl · on Nov 28, 2013

You don't usually download this data - you process it on AWS to your requirements.

Seriously - they give you an easy way to create these subsets yourself[1]. That is a much better solution than them trying to anticipate the exact needs of every potential client.

[1] http://commoncrawl.org/get-started/

malandrew · on Nov 28, 2013

I guess what I was suggesting is "given enough eyeballs, all spam and poor quality content is shallow"

There is definitely a benefit in using the community to identify valuable subsets and then individually putting your energy towards building discovery/search products around that subset.

gsnedders · on Nov 28, 2013

Is the example code still right with the new file formats for this new crawl?

hkmurakami · on Nov 27, 2013

would love to have even smaller subsets (like 5gb) that students can casually play around with too to practice and learn tools and algos :) (if it's not too much trouble!)

Aloisius · on Nov 28, 2013

You can fetch a single WARC file directly like say:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368704392896/warc/CC-MAIN-20130516113952-00058-ip-10-60-113-184.ec2.internal.warc.gz

They are around 850 MB each.

The text extracts and metadata files are generated off individual WARC files, so it is pretty easy to get the corresponding sets of files. For the above it would be:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368704392896/wat/CC-MAIN-20130516113952-00058-ip-10-60-113-184.ec2.internal.warc.wat.gz

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/1368704392896/wet/CC-MAIN-20130516113952-00058-ip-10-60-113-184.ec2.internal.warc.wet.gz

ccleve · on Nov 28, 2013

Is there any way to get incrementals? It would be extremely valuable is to get the pages that were added/changed/deleted each day. Some kind of a daily feed of a more limited size.

froo · on Nov 28, 2013

  s3cmd ls s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/segments/

That should get you about 90% on your way.

alok-g · on Nov 28, 2013

Totally true. Smaller versions are not helpful for just casual/student use; it also helps in code development and debugging. Otherwise, algorithm development gets impeded by scaling issues.

ma2rten · on Nov 28, 2013

Also interesting for machine learning. Where you can use it as a background collection.

daivd · on Nov 28, 2013

One subset for each TLD would be nice. Or, if you can afford more CPU-power, per language, using a good open language detector.

boyter · on Nov 27, 2013

Fantastic news. Will be looking forward to seeing it.