Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

get.theinfo is the best way to find data sets. They are a bunch of data hoarders who can help you: http://groups.google.com/group/get-theinfo/?pli=1

I always ask there if I can't find what I'm looking for.

Here are more and more data sets. These are general data sets. Email me if you have a specific data set in mind (e.g. web-as-corpus, spam, images, social, reviews, etc.). I have a big file of information.

    http://theinfo.org/
    http://infochimps.org/datasets
    http://ckan.org [Comprehensive Knowledge Archive Network]
    http://www.datawrangling.com/some-datasets-available-on-the-web.html
    http://del.icio.us/pskomoroch/dataset
    http://www.reddit.com/r/datasets/
    http://news.ycombinator.com/item?id=1242029
    http://www.reddit.com/r/opendata
    http://www.trustlet.org/wiki/Repositories_of_datasets
    http://www.daniel-lemire.com/blog/data-for-data-mining/
    http://www.quantlet.org/mdbase/
    http://datamob.org/
    http://freebase.com/
    http://infochimp.info/ics/data/ripd/www-personal.umich.edu/~mejn/netdata/
    http://www.archive-it.org/public/all_collections

    Large:
        http://www.ckan.net/tag/read/size-large
        http://www.diggingintodata.org/Repositories/tabid/167/Default.aspx
Web as corpus:

    Good instructions:
        http://corpus.leeds.ac.uk/internet.html#description
    http://sslmit.unibo.it/~baroni/bootcat.html

    http://www.drni.de/wac-tk/index.php/Documentation
etc. Email me if you need more http://cleaneval.sigwac.org.uk/ http://liste.sslmit.unibo.it/pipermail/sigwac/2007-November/... http://wacky.sslmit.unibo.it/doku.php?id= http://clic.cimec.unitn.it/marco/research.html


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: