Hi everyone. Not sure if I should be posting this to a forum or not. Curious to see what peoples answers are here as I read this site a lot
I'm thinking of creating an intelligent crawler in Python. I have a project with a friend where we'd like to crawl a few specific car-related websites, grab some of the info and look for new entries. I am wondering if there is any existing technology out there where a crawler is sent to a site and either trained (visually?) or which can understand repeating information like tables that we could use to create a proof of concept. I'd appreciate any critique of my ideas which is:
1. create a visual tool - probably windows/mac based which uses the browser to navigate a site and to highlight elements that we would like to capture, such as car name, description, price. This would also have to be able to automatically/manually work out repeating elements
2. this tool would create some kind of file (xml?) which would then be used by the main crawler to understand how to navigate the site
3. The crawler, which we'd write in python would visit the site every week to look for new information
Am I going about this the right way or does anyone have any ideas
One point, we would seek permission from the sites before crawling - it would be to their benefit as we're looking to push people their way.
Appreciate any thoughts anyone might have
All the best
John
* automatic wrapper generation
* information extraction
* removing noisy information from Web pages
* template detection
* wrapper induction
"Wrapper" is a fancy computer-science term for "scraper."
I wrote some Python code that does this -- given X sample documents, detect the differences between them and automatically create a scraper tailored to those documents. I released the first version open source -- it's called templatemaker: http://code.google.com/p/templatemaker/ .
But that version of templatemaker is quite brittle, because it was designed to work on plain text as much as on HTML. I've since written an HTML-aware version of templatemaker that is really frikkin' awesome (if I may say!) and beats the pants off the old one. I don't know if I'm going to open-source it, as it's quite valuable to my own startup.
Hope this helps!