It is messy and overly ambitious, but promises something like a return to the "view source" mindset of the old web - where data was in plain sight and anyone curious and a little tenacious could reshape the web for their own needs.
I have gone partway down this path for a related concept, and browser extensions are really the only way to go. The biggest risk and hassle is a reliance on brittle, site-specific logic to make things work well. I haven't dug into this project yet to see how automated any of this is or might become, but if there is an element of community sourcing (like a ruleset for scraping AirBnB effectively) it opens up a potential attack vector like any GreaseMonkey-tyoe script, especially if passed routinely to less technical users. Not a huge issue on day 1 but not an easily solvable issue.
Thanks! "View source mindset" is a nice word for what we're trying to promote with this project.
Brittle site-specific logic is indeed a challenge. So far we've started with the simplest thing possible of programmers manually writing scraping code, so we can focus on how the system works once you have the data available. That has been enough to test the system out and build lots of useful modifications ourselves.
I think eventually some degree of automation will be an important way to help end users use this tool with any website. The "wrapper induction" problem has been well studied and there are lots of working solutions for end-user web scraping, so I expect to be able to integrate some of that work.
We're also interested in a community of shared scrapers, but as you point out there are security considerations. I'm considering trying central code review from the project to approve new site adapters and make sure they aren't doing anything obviously malicious. Another solution could be carefully restricting the expressivity of our scraping system (eg only specify CSS selectors, no arbitrary code) but I doubt that would be sufficient for all cases. Would appreciate any suggestions here.
I really like the term "view source mindset." I think it aptly applies to certain systems where you can intuit what the underlying programming is doing just by interacting with them. Definitely stealing that phrase.
It is messy and overly ambitious, but promises something like a return to the "view source" mindset of the old web - where data was in plain sight and anyone curious and a little tenacious could reshape the web for their own needs.
I have gone partway down this path for a related concept, and browser extensions are really the only way to go. The biggest risk and hassle is a reliance on brittle, site-specific logic to make things work well. I haven't dug into this project yet to see how automated any of this is or might become, but if there is an element of community sourcing (like a ruleset for scraping AirBnB effectively) it opens up a potential attack vector like any GreaseMonkey-tyoe script, especially if passed routinely to less technical users. Not a huge issue on day 1 but not an easily solvable issue.