I second trafilatura greatly. This will save a huge amount of money to just send...

abhgh · 2024-09-03T13:11:28 1725369088

Good to know! Yes, trafilatura is great, sure it breaks sometimes, but everything breaks on some website - the real questions are how often and what is the extent of breakage. For general info., the library was published about here [1], where in Table 1 they provide some benchmarks.

I also forgot to mention another interesting scraper that's an LLM based service. A quick search here tells me it was mentioned once by simonw, but I think it should be better known just for the convenience! Prepend "r.jina.ai" to any URL to extract text. For ex., check out [2] or [3].

[1] https://aclanthology.org/2021.acl-demo.15.pdf

[2] https://r.jina.ai/news.ycombinator.com/

[3] (this discussion) https://r.jina.ai/news.ycombinator.com/item?id=41428274