Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Building scrapers sucks.

It's generally not hard because it's conceptually very difficult, or that it requires extremely high level reasoning.

It sucks because when someone changes "<section class='bio'>" to "<div class='section bio'>" your scraper breaks. I just want the bio and it's obvious what to grab, but machines have no nuance.

LLMs have enough common sense to be able to deal with these things and they take almost no time to work with. I can throw html at something, with a vague description and pull out structured data with no engineer required, and it'll probably work when the page changes.

There's a huge number of one-off jobs people will do where perfect isn't the goal, and a fast solution + a bit of cleanup is hugely beneficial.



Another approach is to use a regexp scraper. These are very "loose" and tolerant of changes. For example, RNSAFFN.com uses regular expressions to scrape the Commitments of Traders report from the Commodity Futures Trading Commission every week.


My experience has been the opposite: regex scrapers are usually incredibly brittle, and also harder to debug when something DOES change.

My preferred approach for scraping these days is Playwright Python and CSS selectors to select things from the DOM. Still prone to breakage, but reasonably pleasant to debug using browser DevTools.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: