We maintain a list of around 1000 major publishers across the world and we crawl it every 15 minutes. For every other publisher (smaller blogs, etc..), they come through our global crawl.
The list itself isn’t particularly hard to maintain. What’s hard are the myriad of rules and configurations required to crawl and scrape each publisher. We built a model that extracts article data and it does a good job figuring out headlines, images, authors, and text.
Scraping rules are very self-manageable if you're planning on crawling just a few publishers. But jt gets exponentially more difficult to crawl hundreds.
The list itself isn’t particularly hard to maintain. What’s hard are the myriad of rules and configurations required to crawl and scrape each publisher. We built a model that extracts article data and it does a good job figuring out headlines, images, authors, and text.
Scraping rules are very self-manageable if you're planning on crawling just a few publishers. But jt gets exponentially more difficult to crawl hundreds.