Hey,
Does anyone have any advice on how to download yahoo answers without angering anyone? Is this dataset published anywhere?
I'm putzing around with some statistical methods to automatically predict which answers will be highly ranked on question/answer sites like yahoo answers, stackoverflow.com, et. al.
So far I have trained my methods using the published data dumps of stackoverflow.com - the results are interesting/encouraging and I'd like to work with a softer dataset like yahoo answers where the questions are less technical.
(Incidentally this method gives interesting results for predicting points of comment threads on HN, however I refuse to release this without making a nice interface for people to browse)
Here a couple ideas that might help (not that I condone violating a site's TOS).
Have a random time interval between each page download so someone looking at the logs doesn't see a regular pattern.
Pick 4 or 5 common user agent strings to alternate randomly between.
Go through free proxies, or package up your program and have 5 or more friends run it on different sections of the data you want to download.