Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We maintain a list of around 1000 major publishers across the world and we crawl it every 15 minutes. For every other publisher (smaller blogs, etc..), they come through our global crawl.

The list itself isn’t particularly hard to maintain. What’s hard are the myriad of rules and configurations required to crawl and scrape each publisher. We built a model that extracts article data and it does a good job figuring out headlines, images, authors, and text.

Scraping rules are very self-manageable if you're planning on crawling just a few publishers. But jt gets exponentially more difficult to crawl hundreds.



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: