> So the question is: can you write a web scraper in your browser? The answer is: YES, you can! So why is nobody doing it?
> One of the issues is what is called CORS (Cross-Origin Resource Sharing) which is a set of protocols which may forbid or allow access to a web resource by Javascript. There are two possible workarounds: a browser extension or a proxy server. The first choice is fairly limited since some security restrictions still apply.
I'm doing this for a browser extension that crawls a website from page to page checking for SEO/speed/security problems (https://www.checkbot.io/). It's been flexible enough, and it's nice not to have to maintain and scale servers for the web crawling. https://browserflow.app/ is another extension I know of that does scraping within the browser I think, and other automation.
Well, there's not a lot any crawler will be able to do if a website is gated with aggressive bot detection e.g. Puppeteer via a proxy will similar problems. Even if a bypass is found, it could break tomorrow. I've rarely had support messages about this, but most of them were resolved by adding IP addresses or user-agent/header strings to an allow list, or turning down how aggressive the bot detection is. Checkbot is more for crawling sites you have control over so there's more options here.
It is worrying what this means for the future for web crawlers in general though if most sites end up being gated to all bots that aren't from major search engines.
> One of the issues is what is called CORS (Cross-Origin Resource Sharing) which is a set of protocols which may forbid or allow access to a web resource by Javascript. There are two possible workarounds: a browser extension or a proxy server. The first choice is fairly limited since some security restrictions still apply.
I'm doing this for a browser extension that crawls a website from page to page checking for SEO/speed/security problems (https://www.checkbot.io/). It's been flexible enough, and it's nice not to have to maintain and scale servers for the web crawling. https://browserflow.app/ is another extension I know of that does scraping within the browser I think, and other automation.