Ask YC: Any ideas about intelligent crawlers :)

adrianh · on Jan 8, 2008

I wanted to do this same thing a while ago and have done a lot of research and reading in this area. Here are some search terms that will likely help you:

* automatic wrapper generation

* information extraction

* removing noisy information from Web pages

* template detection

* wrapper induction

"Wrapper" is a fancy computer-science term for "scraper."

I wrote some Python code that does this -- given X sample documents, detect the differences between them and automatically create a scraper tailored to those documents. I released the first version open source -- it's called templatemaker: http://code.google.com/p/templatemaker/ .

But that version of templatemaker is quite brittle, because it was designed to work on plain text as much as on HTML. I've since written an HTML-aware version of templatemaker that is really frikkin' awesome (if I may say!) and beats the pants off the old one. I don't know if I'm going to open-source it, as it's quite valuable to my own startup.

Hope this helps!

mig · on Jan 8, 2008

Don't write your own crawler. Use nutch.

It is designed to scale and do mapreduce kind of parallel processing. I would strongly recommend you to take a look before writing your own.

http://lucene.apache.org/nutch/

imsteve · on Jan 9, 2008

mapreduce? Just how many requests will you be making to third-party sites at once? Sounds like a good way to get blocked fast.

aristus · on Jan 8, 2008

I think you are in over your head, but it's a great way to learn about the plumbing and underbelly of the Web.

This visual tool is basically what a company called onDisplay was doing back in 1999, before they were bought by consulting firm Vignette for an obscene amount of money. But scraping against the html structure is a losing battle.

A better approach is to use clues in the information itself to guess its content: something with a "$" is a price. Something containing "toyota" is probably a name, "blue" a color, more than 20 words containing "good", "v8" is a description, etc. That way your scraper is resistant to structure changes.

All that is separate from the problem of a crawler. It takes a long time and a lot of effort to convince content sites that what you are doing is a) helpful to them and b) something they should not be doing themselves.

It's like jumping on stage with the band and starting to play. You better be really good and friendly and prepared to get the crap beaten out of you.

franklymydear · on Jan 8, 2008

Not sure if this is appropriate - as not responding to one particular person - but I'd like to send a BIG Thank You to everyone out there for the advice, encouragement and sometimes the reality "keep your feet on the ground" type stuff. This has inspired me to move forward with this. To those who have done stuff like this before, thanks for the links and I'm grateful for you sharing your experience.

If its OK I'd like to let people know about my experiences. Oh, if anyone is interested in collaborating or just sharing ideas then I'd be happy to do likewise

All the best

akkartik · on Jan 9, 2008

Good idea. Add an email or website at http://news.ycombinator.com/user?id=franklymydear and then you can exchange private messages by multicast rather than broadcast.

Feel free to ask me more questions by email. I spend a fair bit of time thinking about html parsers.

franklymydear · on Jan 9, 2008

Thanks! I've just done this. Going to do some research into this area and then make a plan to start in the next week.

akkartik · on Jan 8, 2008

For 1 you mean you want to build a parser for arbitrary html that your crawler returns. Hard problem, as others have said. My advice:

1. Use an html parsing library. Beautiful soup (python) or hpricot (ruby) are good building blocks.

2. Practice manually building parsers for a few sites, then see if it leads you to any insights about how to generalize the process.

3. Ignore everything else until you do 2. Just use wget as your crawler. Skip the visual interface for now; just parsing arbitrary pages is a hard enough problem to bite off.

ntoshev · on Jan 9, 2008

Someone mentioned dapper.net and I upmodded it, but I think it will get lost in the noise.

As far as I understand, they are very close to what you are trying to do, so study them carefully as a competitor.

franklymydear · on Jan 9, 2008

Yes, this is close to what I want to do in terms of functionalioty. My idea was to use a wizard-like approach to record the elements of a page that we need to capture and how to navigate through a specific site. They appear to be doing this, though the system has failed a few times on me - they're doing it in a browser, whereas I'd planned to create an app. Very interesting though. Anyone who knows of anything similar or who is interested in building something like this, get in touch

ntoshev · on Jan 9, 2008

Actually I might be interested - please leave an email.

showerst · on Jan 8, 2008

I know in php it's possible to load an html document and parse the DOM tree using XPath expressions, presumably that capability exists in python.

So i guess in theory you could write a frontend (firefox extension?) where you could highlight / select a screen area (webdeveloper already does this), then pass it's DOM information (i.e. #body table tr td#username ) to your backend, which would then scrape that field(s) from any applicable site pages.

This of course assumes that 1) The website(s) are well formed enough for your parser and 2) Well programmed enough that the same info is in the same place in the DOM tree, and preferably ID'd, which are pretty HUGE assumptions, but could be worked around if you were determined enough.

Not sure if this is what you're looking for, and it seems a bit circuitous, but it's a plausible idea anyway.

spoonyg · on Jan 8, 2008

I have built up something similar to what you are describing and it was a fun project. My first reaction to #1 is that if the information you want to repeatably in the same place you are probably better off just doing things manually and not going to the trouble of building a visual tool. In my experience the interesting pieces of data tend to move around and something like regex is the best way to handle this. I used wget to grab data because it was quick and easy. I then did the post processing in the background, separating out the grabbing of data and the interpretation of data.

akkartik · on Jan 8, 2008

I remember seeing a screencast of some startup that did something like this. It was maybe a year ago. You click on elements, it shows you other hypotheses, you correct if necessary, and you get an RSS feed compiled from the page structure.

Anybody remember this?

yubrew · on Jan 8, 2008

dapper.net

akkartik · on Jan 8, 2008

inovica · on Jan 8, 2008

What you are talking about is the deep web I think and I don't think anyone has managed this yet! Essentially you want a system that can fill in forms and pull back results. I think it needs to be done on a per-site basis

popephatt · on Oct 14, 2008

John,

One thing that I would point out about the script you intend to write is that it requires an awful lot of maintenance (when sites change layout) and is frequently not very reusable. One solution that I tried is Mozenda (http://www.mozenda.com). The have all the stuff you're looking for (i.e. visual, browser-based tool, writing to XML) but also have error handling and notifications, so that if an agent breaks you'll know and be able to fix things inside the visual tool.

imsteve · on Jan 9, 2008

It's called "scraping" and I've done it lots of times with python, very easy. Don't bother with those other specialized, non-python, frameworks that people are suggesting.

http://wwwsearch.sourceforge.net/mechanize/

And if you need to do complicated html parsing in combination with that:

http://www.crummy.com/software/BeautifulSoup/

From there, it's cake.

DarrenStuart · on Jan 8, 2008

I know this is ruby but might be worth a look. Will help you no end and no need for a web browser.

http://mechanize.rubyforge.org/mechanize/

might be worth a look http://www.crummy.com/software/BeautifulSoup/

akkartik · on Jan 8, 2008

Another talk I saw at shdh in october (http://superhappydevhouse.org/SuperHappyDevHouse20):

http://tagtheplanet.net

They seem to be attempting an intelligent crawler as well.

toddcw · on Jan 8, 2008

You ought to check out screen-scraper (http://www.screen-scraper.com/). It's a commercial app, but the best I've used for this kind of thing. They also offer a freeware version.

bluelu · on Jan 8, 2008

What do you want to do?

At least in Germany, there exists a few solutions which do exactly that. If a person puts his car on sale (a bargain) on one of the car related websites, he get's the first call in about 20 seconds from someone using these programs.

ivan · on Jan 8, 2008

And still one thing .. if you want to ask the site owner for permissions, why not ask them to produce some specific xml file for you?

akkartik · on Jan 8, 2008

Because granting permission is easy. Why would they go to more effort than that for random people?

ivan · on Jan 8, 2008

Why thousands of job sites produce custom xml output for simplyhired or indeed ??

imsteve · on Jan 9, 2008

They like buzzwords?

marketer · on Jan 8, 2008

I think this is the best way to go. There's no reason that you should be scraping HTML from sites, when there might be a nice xml feed available. For instance, pricegrabber will only index your site if they have prices in XML.

sonink · on Jan 9, 2008

webharvest ?

benn · on Jan 8, 2008

Does anyone want to write a search engine? Python and C++. I thought we could analyze the links between pages and come up some kind of ranking algorhythm. We'd have to make the system widely parallel - but I think we could be breaking some new ground here.

Any takers?