Wow, I absolutely love this concept. It is messy and overly ambitious, but promi...

gklitt · on Feb 28, 2020

Thanks! "View source mindset" is a nice word for what we're trying to promote with this project.

Brittle site-specific logic is indeed a challenge. So far we've started with the simplest thing possible of programmers manually writing scraping code, so we can focus on how the system works once you have the data available. That has been enough to test the system out and build lots of useful modifications ourselves.

I think eventually some degree of automation will be an important way to help end users use this tool with any website. The "wrapper induction" problem has been well studied and there are lots of working solutions for end-user web scraping, so I expect to be able to integrate some of that work.

We're also interested in a community of shared scrapers, but as you point out there are security considerations. I'm considering trying central code review from the project to approve new site adapters and make sure they aren't doing anything obviously malicious. Another solution could be carefully restricting the expressivity of our scraping system (eg only specify CSS selectors, no arbitrary code) but I doubt that would be sufficient for all cases. Would appreciate any suggestions here.

willberman · on Feb 28, 2020

I really like the term "view source mindset." I think it aptly applies to certain systems where you can intuit what the underlying programming is doing just by interacting with them. Definitely stealing that phrase.