Hacker News new | past | comments | ask | show | jobs | submit login

I think you are in over your head, but it's a great way to learn about the plumbing and underbelly of the Web.

This visual tool is basically what a company called onDisplay was doing back in 1999, before they were bought by consulting firm Vignette for an obscene amount of money. But scraping against the html structure is a losing battle.

A better approach is to use clues in the information itself to guess its content: something with a "$" is a price. Something containing "toyota" is probably a name, "blue" a color, more than 20 words containing "good", "v8" is a description, etc. That way your scraper is resistant to structure changes.

All that is separate from the problem of a crawler. It takes a long time and a lot of effort to convince content sites that what you are doing is a) helpful to them and b) something they should not be doing themselves.

It's like jumping on stage with the band and starting to play. You better be really good and friendly and prepared to get the crap beaten out of you.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: