The toxicity towards web-scraping is really what makes me lose hope in the current web.
People want their data to be public and all of the benefits that comes with public data but then they want to chose who gets to see it - it's a complete and utter paradox.
This precedent doesn't really mean much but is definitely step in the right direction.
The issue here for some, if not many, is a matter of scale. It is one thing if an end-user, whom I am trying to service, comes to my site and gets my publicly available data. Maybe I monetize with ads, maybe not. It doesn't matter, that is the audience I am trying to service, regardless of size.
But when you scrape it my load goes up dramatically. A load I have to pay for.
It is analogous to the privacy debates going on with one said saying "hey, don't track everywhere I go and tag me with facial recognition" and the other side saying "hey, you are in public and people can see you." The issue is not complete privacy, but one of scale. And of intent.
I believe society is soon going to have to come to grips with the scale of things and legislate what are acceptable scales of action as it seems to be becoming a large issue in a growing number are areas.
So you throttle your users. We have http status codes for "too many requests" and all scraper software comes with a delay setting by default. Everybody who does scraping is supposed to know that its rude to blast a thousand requests per second.
This ruling has left open a big question of how much you need to spend to support scrapers and where the line between scraping and a DoS attack lies - and that's going to be a weird line. If my site is producing a big report off of data that changes quarterly then re-downloading that report every 20 minutes is possibly excessive and might wander into the realm of an attack - while if we looked at the same frequency with twitter it seems a lot more reasonable - maybe even a bit on the slow side.
Entirely feasible. Also reasonable for you to pay me for the service as it is taking my development efforts to meet your business model. The advantage to you is you have a defined interface that I won't prevent.
I guess you missed the comment I was replying to: it may cost you more money, in bandwidth and per page resources, to not provide an API than it does for you to provide one.
So no, I won’t pay you for the privilege of you saving money.
Who pays for that API and the bandwidth? What’s in it for the data provider? On LinkedIn, viewing the data now shows ads or at least prompts the viewer to join the network. With scrapping and free API access, how exactly does LinkedIn benefit for their work of hosting the data?
Unless the costs exceed their current operational costs. Don't forget the time spent redeveloping on the new API, which includes validating everything is there, testing and cleaning up and removing the old (working) code.
This isn't a great analogy here - getting the data delivered via API is simply more useful than having to re-assemble that data out of fragments parsed off of different web calls.
Could I suggest:
"Why buy the cheese when you get the milk for free?"
Look, if you build a product that relies on providing free information to the public, then you don't get to select a segment of that public and charge them for it. You can't hang a billboard on a highway, but then get upset when some people look at it the wrong way.
Now, if you want to have a walled garden and charge for entry to some and let others in free then that is fine.
> People want their data to be public and all of the benefits that comes with public data but then they want to chose who gets to see it - it's a complete and utter paradox.
That's a complete misconception. Of course you can find manufacture inconsistent ideologies if you combine ideas from different people, but I think you'd have a difficult time finding one person who believes what you just described.
What I want is, put simply, organizational transparency, personal privacy. I believe humans have a right to privacy, but I don't believe organizations have rights, period, and I believe radical transparency within an organization prevents organizations from trampling the rights of individuals.
Organizations in this case include corporations, governments, and nonprofits.
I'm interested in hearing your take on "organizational transparency". Like please push the concept / idea to its 'full' realization and tell me that picture, even if it implies a little bit of "sci-fi"¹.
Digging this because I think that domain / paradigm will see unparalleled evolution in the next few decades.
[1]: I mean, don't stop at current law / values / behaviors; like people from the 1940s wouldn't have dared speak about their idea of the 1970s because they'd think their belief "impossible". No flying cars though (Clarke-tech), because that's not a decision of the individual.
I don't think that looking too far ahead is useful: this is just a matter of pragmatics. Revolutionary change in a peaceful society happens via a long sequence of small, incremental changes, and that's a good thing, because you get to see how each of the changes plays out. I think the best sci-fi persuades you that it's looking at the distant future when in fact it's only using the future as a foil to provide deep insight into the present.
The short-term, the small, incremental changes I'd like to see are:
1. Reversal of the default privacy setting of government docs. Instead of documents being default-private and citizens having to make FOIA requests to make those documents public, documents should be default-public, and government workers should have to apply through and adversarial system (similar to courts) to classify documents, proving to a court why the document needs to be classified.
2. Classified documents should have a short (1 year max) timeframe after which they are declassified, or government workers should have to reapply to justify why the documents need to remain classified.
3. Political party documents should be public, without any provision for classifying them.
4. Tax-exempt organization documents should be public, without any provision for classifying them.
5. IPO'ed organization documents should be public, without any provision for classifying them.
6. Body cams on all police and military while on duty (when they are acting on behalf of an organization). 1 and 2 would apply to the footage from these cams as well.
7. Exceptions to 1-6 should be made for the personally-identifiable information of people who are not in the organization.
8. Organizations should be required to maintain a list of all the personally-identifiable information they have on a person (including employees), and provide that data to that person on demand by that person or their legal guardian, as well as a list of all people with whom that data has been shared, and be required to delete that information upon request by that person or their legal guardian.
9. Research which receives public funding should be forced to publish its results publicly.
10. All software which receives public funding should be forced to publish its source publicly.
11. Government documents should be published in open-source formats suitable for computer analysis (i.e. CSV, text, or some XML format--no PDFs).
1, 2, 6, 11 should be no-brainers if people were educated IMHO — but this is 1920 relative to electricity or cars; still a long way to go before the mainsteam masses get it (which very much includes political figures). I would think 2030-2040 for the emergence of ethical consensus and concern (the kind that pervades political parties and social classes).
That is assuming the needle doesn't move too much farther in the authoritarian direction until then (the 20-year trend is really not looking that way currently).
3, 4, 5, 9, 10 are/would be met by strong opposition from interest groups, I'm sure you see that too. Everything I know about 3 tells me it's never going to happen with current parties / politicians. It's at least 1 generation away and I'm not sure the concept itself isn't utopia. 9 and 10 as well, I think it largely depends on the cultural paradigm (and this world's in 2020 is really not aligned with that, nor does it trend or even look that way). 4, 5 likewise, complex topics, lots and lots and lots of gatekeepers and lobbyists.
My take on these is they're very costly in terms of political capital; and they are largely debatable (politically, legally, philosophically, etc., you'll find passionate captains on both sides); thus there are 'better' (more consensual, with direct net positive effect) lower hanging fruits imho.
7 and 8 are hard problems, notably because of scale and the need for automation — it's part of a much bigger domain, automation of compliance and building "trustable" systems etc.; the kind that bridge or plane engineers must build, and probably software engineers too, but you know we're far from that if you read this forum.
I'd say 1 2 6 11 and 7 8 on the way to scale/automation already paint a whole different regime and degree of maturity for a 21st century State. I'd like to think we're now ~1 generation away from enactment of such norms.
What if the organization is one person in an LLC? Do they get rights? If so then a big company can hire a bunch of little LLCs to act as rights-having proxies for any task that requires them.
I'm going to assume you're asking in good faith and try to address the confusion here.
The human does get rights, the organization doesn't.
In some cases, believing that humans have rights and believing that organizations have rights might lead one to the same action. In those cases, I'd take the action. I wouldn't want to violate a human's rights out of some vindictive dislike of organizations: that's not the point. The point is that I'd take that action because I believe in human rights, not because I believe that the organization has rights.
And with organizational transparency: the entire point of organizational transparency is to protect human rights. In cases where organizational transparency would trample human rights to privacy, I would go with human rights every time. Violating the privacy of humans to achieve organizational transparency would defeat the entire purpose.
Let's say that individual humans have the right to keep secrets. Let's also say that they have the right to keep secrets with their associates, and to tell them to who they please. Now, doesn't that make it legal for a group of people to keep secrets about you? What about selling them? I just don't see what doing away with the legal fiction of corporate personage would do about Facebook.
It may not be your intent, but you're using some very vague, inapplicable terminology to make some screwed up behavior sound normal.
If you can tell secrets to who you please and sell them on the internet, they aren't secrets. Somewhere in the middle of what you're saying, the secrets stopped being secrets, but you kept using the word as if it still applied.
Facebook isn't a group of associates trading anecdotes about their friends: the server guy has never met Mark Zuckerberg, and they are not "associates" in any meaningful way. They're not friends, or even really allies: Facebook certainly has shown inconsistent concern for the well-being of its workers. So let's also drop the "associates" terminology: these aren't "associates", they're employers and employees. Employees aren't acting as individual humans on their own behalf, they're acting on behalf of an organization.
Putting aside the rights conversation for a second, let me ask you a question: if you tell your friend a secret in confidence, and they turn around and sell it to anyone on the internet who will pay a low fee, that would be pretty screwed up, no? We don't even have to talk about rights here: this is just screwed up behavior, regardless of the rights conversation.
Are contracts allowed in your worldview? Contracts must be signed by individuals, but when they act as representatives of a company, they are legally binding for that company. If they only have individual rights, then all individuals who didn't physically sign a contract cannot be held to it.
At a small scale, it makes sense for people to appoint another person whom they trust to represent them in negotiation. But in a lot of cases, that's not how representatives are chosen. Particularly in corporations, the leadership of a corporation was not chosen by the employees to represent them, and in fact often doesn't even have the best interest of the employees in mind. A lot of the largest problems in our society arise from this fact.
Consider the case of a company that agrees to sell to a larger corporation, under the condition that they lay off half their workforce in advance of the sale. Surely we can agree that the laid-off workers were not fairly represented by the person signing the contract.
One could argue that the workers agreed to give up some of their rights in their employment contract, but I'd argue that they did so under duress: their option is sign the contract and work for the company, or starve and let their families starve. Sure, they can go work for another company, but other companies will require them to similarly sign away their rights.
This shouldn't be taken as a recommendation to blithely break contract law. Corporations don't have rights, but they do have power, and it would be unwise to behave as if they can't make your life miserable if you decide to cross them.
It's a bad deal for corporations, but I do not care. Lack of liability is the cause of a ton of problems in our society.
Just to pick two stories of corporate sociopathy: Probably the reason people at State Farm are unconcernd about forging signatures[1] is that they know that the worst case scenario is that State Farm loses some business and maybe gets a fine: they are unlikely to go to jail for forgery or to have fines exacted from their personal bank accounts. Similarly, when Practice Fusion literally killed people[2], their execs had little to fear: nobody went to jail, nobody was fined: shareholders who had no visibility into the decision paid the fines.
When banks tanked the economy with irresponsible lending most were bailed out and gave their workers bonuses, while the people who were unable to pay their mortgages were ignored.
A little more liability for destructive behavior would be great for most people.
Forgery is criminal regardless of private or official document is concerned. Even in a military setting, forgery of business-related documents are illegal.
> A little more liability for destructive behavior would be great for most people.
Why not full rights, full liability? Replace imprisonment and death by temporary and permanent suspension of company (including re-establishment of a sequel organisation out of a subset of stakeholders) respectively; and voila.
Making organization membership public would trample on personal privacy quite effectively in some respects, such as with disease support groups or PACs; medical privacy is taken seriously, but is there such a thing as political affiliation being private? Is it a violation of someone's privacy to reveal they give to the ACLU?
This is where pseudonymous identities make a lot of sense.
In a world where organizations are radically transparent and individuals have radical privacy, before you join the disease support group or donate to the ACLU, you already know the organization's records about that transaction will be public information.
You can restrict your activities to organizations that do not keep personally identifiable information. Or you could join the support group with a pseudonymous identity or donate cryptocurrency to the ACLU.
The entire point of organizational transparency is to prevent organizations from trampling the rights of individuals, so in cases where organizational transparency would trample the rights of individuals, the rights of individuals supersedes the need for organizational transparency.
I'm gonna need some context there. To begin with you can't just donate to support a bill. What you probably mean is that he donated to an organization or politician who supported Prop 8, so say that.
The important detail is whether the funds were his personal funds, or whether they belonged to his company: i.e. was he acting on his own behalf, or on behalf of an organization?
It's an inane question, though, because the inane point you're trying to make is that individual privacy might protect homophobes. That's true, but it would also protect gays and allies who donated to oppose proposition 8. The fact that privacy allows people to secretly donate to political campaigns is the point. Part of the reason gay marriage is legal today is that people donated to people like Harvey Milk, at a time when donating to the campaign of a gay governor was risking your job and social standing.
Human rights still apply to humans who do bad things. If you are willing to give up human rights to fight bad people trying to do bad things, then those rights won't be there to protect good people trying to do good things, either.
So a person spends money to support an odious cause, and that person gets money by being associated with an organization. You hate the odious cause, so you want to avoid giving your money to support people who will then give money to promoting that odious cause. How does that work? Do you no longer have the right to not indirectly support things in your ideal system?
Since you're just openly ignoring the post you're "responding to", I'll just copy-paste my response to what you have just said, with some minor changes:
> So a person spends money to support an odious cause, and that person gets money by being associated with an organization. You hate the odious cause, so you want to avoid giving your money to support people who will then give money to promoting that odious cause. How does that work? Do you no longer have the right to not indirectly support things in your ideal system?
Yes, but it would also protect people who donated to a virtuous cause. The fact that privacy allows people to secretly donate to political campaigns is the point. Part of the reason virtuous causes have had any success at all is that people donated to support them, at a time when donating to those virtuous causes was risking your job and social standing.
Human rights still apply to humans who support odious causes. If you are willing to give up human rights to fight bad people trying to support odious causes, then those rights won't be there to protect good people trying to support virtuous causes, either.
> Since you're just openly ignoring the post you're "responding to", I'll just copy-paste my response to what you have just said, with some minor changes:
I'm attempting to clarify my question by removing irrelevant details, which it looked like you got hung up on last time.
> Human rights still apply to humans who support odious causes. If you are willing to give up human rights to fight bad people trying to support odious causes, then those rights won't be there to protect good people trying to support virtuous causes, either.
So you are willing to make it impossible for people to boycott organizations as a means of social change, as long as those organizations had an arms-length relationship with anything "political".
> So you are willing to make it impossible for people to boycott organizations as a means of social change, as long as those organizations had an arms-length relationship with anything "political".
No. Please try to respond to what I actually say instead of making stuff up; this is a straw man argument.
There are plenty of other ways we could find out about organizations supporting odious causes and boycott those organizations, without violating the privacy of their members. In fact the the point of "organizational transparency" is to make it hard to hide when organizations do bad things.
In addition to accusing me of saying things I didn't say, you're ignoring what I actually did say. Are you willing to make it impossible for people to privately donate to virtuous causes as a means of social change, when donating to support those causes publicly so would be a risk to their careers and reputations? I'm not going to continue this conversation further if you won't respond to this point.
> I don't believe organizations have rights, period
So a group of people, joining together in a common cause, don’t have rights as members of that group?
You are contradicting yourself. Organizations are simply groups of people with a shared cause. To deny rights to the organization, you necessarily have to deny personal rights. Example: John and Sam form an organization for comic book collecting. They have a secret meeting to agree on a price to offer a potential seller of a rare comic book. Since John and Sam, Inc. “Have no rights” the seller is allowed to sit in the room during their meeting. However this violates the right of Sam to freely associate with John and his right to not have to associate with the interloper. It also violates the privacy of San and John as individuals since their private conversation — even as members of their two-man organization are now being shared with anyone who wants to sit in since, in this scenario, their organization doesn’t have rights.
> So a group of people, joining together in a common cause, don’t have rights as members of that group?
Emphasis added. No, they don't. They have rights as individual people.
> [...] this violates the right of Sam to freely associate with John and his right to not have to associate with the interloper. It also violates the privacy of San and John as individuals [...]
Yes, that. "Sam" and "John" would have cause for civil (and probably criminal) action against the interloper. "John and Sam, Inc" has no say in the matter.
It might be useful to grant John and Sam, Inc the privilege to own property, but even that isn't actually a right except insofar as Sam and John have a right not have the value of their assets (ie 50% ownership of John and Sam, Inc - functioning as a proxy for ownership of various comics) actively sabotaged/vandalized.
> Organizations are simply groups of people with a shared cause.
No, they're not. As soon as you have two people in an organization, you've got two different causes. The shared elements of those causes allow them to collaborate, but each of them has slightly different views of what they're working toward. And even when you have a very small, well-specified goal, every individual in the organization has different levels of investment, and other values and boundaries they're not willing to cross to achieve that goal. And the larger organizations get, the wider the variety of disparate goals that can occur within the organization, because individuals in the organization may not even interact directly with one another.
Example: John and Sam work for Facebook. John and Sam want to make money to feed themselves and their families, and don't want to have PTSD. But Mark Zuckerberg wants John and Sam to look at an endless stream of horrific PTSD-inducing images so that he can maintain the reputation of the Facebook platform and get incredibly rich. Where is the shared cause here, exactly?
> To deny rights to the organization, you necessarily have to deny personal rights.
Just because an organization doesn't have rights doesn't mean we have to go out of our way to take away their rights.
> Example: John and Sam form an organization for comic book collecting. They have a secret meeting to agree on a price to offer a potential seller of a rare comic book. Since John and Sam, Inc. “Have no rights” the seller is allowed to sit in the room during their meeting. However this violates the right of Sam to freely associate with John and his right to not have to associate with the interloper. It also violates the privacy of San and John as individuals since their private conversation — even as members of their two-man organization are now being shared with anyone who wants to sit in since, in this scenario, their organization doesn’t have rights.
It sounds like you've figured out that John and Sam, as individuals, each have a right to have a private conversation. John and Sam, Inc. doesn't have rights, but that doesn't suddenly remove John and Sam's individual rights.
The entire point of organizational transparency is to protect the rights of individuals, so obviously if organizational transparency would violate the rights of individuals, the individual rights to privacy supercede the need for organizational transparency.
It's telling that your example of an organization has two people in it. At a small scale, organizations tend to protect the rights of the individuals in the organization fairly well. It's at larger scales that the non-rights of an organization come in conflict more often with the rights of individuals.
You see this inconsistency in individual people all the time. Like how almost everyone wants to be found easily on LinkedIn, Twitter, etc, so they add a public profile photo. Then they freak out because someone starts scraping these photos to build a facial recognition model.
One of my clients is involved in property tax collection and reporting. Property Tax records are public info, and their website allows looking up the records for any property without a login. However, the data behind this website it the _source_ of the public records, and not the public records themselves (which would be local government databases).
For years now we've been in an arms race with someone using a botnet to scrape all of the account information for a particular county. My client doesn't care so much about the data; it's the server load that's a problem. Normal activity for this site is a few dozen account searches per minute, but when the botnet gets through our blockade it sends hundreds of search requests per second, overwhelmimg the site. The operator of the botnet has NEVER tried to contact my client to ask for an efficient api to access the data, which they'd probably provide for a minimal fee.
I work on a fully open data repository. The website has the API linked in 3 places, so when I find inappropriate scraping I block it with "HTTP 420 ... see <API link> or contact <email>".
Some people probably switch to using the API, but no-one has ever contacted us. They either give up, or run their scraper on a different computer -- I've seen the same scraper move between university computers, departments, then (in the evening) to a consumer broadband IP.
I really don't understand why anyone would bother writing and using a web scraper when an API exists. Does the API not provide all the same data/functions as the website? Scrapers are a big PITA compared to just using an API: they're much harder to write to be reliable, and they can break at any time, whenever the site makes even the smallest change. APIs avoid all that mess, and make performance far better too (on both sides), since you're only downloading the data you want, not a ton of Javascript and HTML that you don't.
APIs are often not as complete as the web interface, since the customer sees the web interface and normally the customer is what drives the revenue model of the company.
If pages are driven via an API, then the API is preferable, but publicly facing websites are often a mix of server-side HTML generation and API enrichment, for caching if nothing else.
In that case it seems that the webmasters complaining about scraping need to make sure their APIs actually provide access to all the same data, if they want people to use the APIs instead of scraping.
If the scraper contacted the client, said what they need the data for, and (probably) paid for api access, then my client would probably go for it.
My client is under no obligation to make access to this data easier. It's not really their data either; the information is property addesses, owner names and addresses, and tax assessments and payments. My client wouldn't want to make it easier for scammers to get that data. So they're not going to do anything unless they know the scraper is legit. If that's the case, the api would require authentication, and any fees would be for the server load, not the data.
For what purpose? That’s like suggesting that if people keep jumping your fence and trampling your roses because it’s a shortcut to a public park (in this case, the county records office) that already has public access roads, that you should be obliged to build a sidewalk through your garden, at your own expense, when the real answer should be that the public road should be improved.
If your goal is to get your roses to stop being trampled, it's probably easier to install a few pavers than to spend years petitioning to get a road built.
The ideal answer and the efficient answer are not usually the same.
Yeah, the real problem with scraping is that it's often done very haphazardly and bluntly. Sometimes it's very difficult to tell the difference between a scraper and someone trying to DOS your site.
I'm in a similar job. We block people from scraping if they break a threshold, but we also refer them to the reporting system, which can get all of the information that they are collecting in a variety of formats.
I wonder if something like this would be allowed: if all the public information was available in a well-collated format, then can scrapers be blocked? I imagine that will eventually be fought in court as well.
We did figure out who the scraper probably is, but only after several years. For a long time they used an untracable botnet, but after blocking that they eventually switched to a corporate network we traced to a data aggregation company. But we don't know for sure who's doing the scraping; it could be the company, a rogue employee, or a botnet that got loose on their network.
At work we have all of our data available publicly as easy to parse XML files, but no matter what we do the bot owner's refuse to use it. They'd rather hammer our search engine with sequential searches instead.
We've done that, but it's tough to rate-limit a botnet because of the ip address spread. Also, their crappy scraper software doesn't even bother to check if requests are successful; it spews them just as fast no matter how our site responds.
No. They botnets works through multiple regions on multiple cloud providers - that's how they achieve such high throughput. For any single IP address, the load is reasonable, but for the whole botnet it's absurd.
Currently bot traffic accounts for 2/3 of my load, meaning that the cost of providing my service is 3x what it would be without these persistent bots.
just put an option to download the raw csvs buried somewhere there. someone who is putting in the effort to bot scrapers will find that link, and save your server the load.
People don't want their data to be public. People want other people's data to be public. One's own data everyone thinks should be private and tightly controlled. This applies to people and businesses equally.
In this case, LinkedIn users kind of do want their “public profiles” to be public. They’re online CVs; by definition, if you make one, your goal is to get it into the hands of anyone who asks for it!
LinkedIn, likewise, has built its business model on an implicit contract with its users that it’s going to show their CV to anyone who asks for it.
I think LinkedIn users would be surprised that LinkedIn doesn’t let bots read (scrape) their public CV. A CV (individually, rather than in aggregate) is ultimately useful for only one thing: marketing the CV’s author’s skills. Why wouldn’t I want my marketable skills scraped into some private “talent matchmaking” agency’s databases, such that someone could find me—and hire me—when I show up as a result of some fancy OLAP query they paid that agency to run on their scraped data? It’s more roundabout than them just finding my CV on LinkedIn, but I’m still glad they found it!
>I think LinkedIn users would be surprised that LinkedIn doesn’t let bots read (scrape) their public CV.
LinkedIn is really really clear that:
- They won't share you information with 3rd parties
- You're not allowed to use information on LinkedIn for commercial purposes without their permission
- Other users can view your personal data
So, why would I expect random third party companies to be able to scrape and sell my personal information?
My personal information is there for the individual use of others, and for authorised use by recruiters (who are vetted/managed by LinkedIn).
I've chased down the convention spam mail I get using my GDPR rights, and surprise surprise, they got my details by scraping LinkedIn. That is absolutely not expected nor acceptable use of my data...
There’s a difference between “my information” and “the public webpage that I went through a publishing workflow to create from a curated selection of my information.”
Let me put it this way: if I have a Wordpress blog, I’d certainly be miffed if Wordpress let bots see my drafts... but I’d also be miffed if Wordpress didn’t let bots (Google, for one!) see the published blog itself. It’s a blog; a public website! Anyone or anything with the URL is supposed to be able to retrieve the page! It’s not “my information” any more†; it’s been broadcast!
† You might want to mentally analogize to copyright, but I don’t think it’s the right model for the intuition people have here. Instead, try mentally analogizing to confidentiality. When a classified document is published in the public sphere (e.g. as evidence in a trial, as testimony before congress, etc.), this forcibly declassifies it. No matter how much the originator of the document might want to still keep it a secret, the legal protections of confidentiality don’t apply to it any more: it’s out there now. Anyone who reads it could plausibly have just read the public-sphere copy, so there’s no longer any way to charge people who have knowledge of the previously-classified information with any crime.
> Would you be OK with a company scraping your blog and selling it?
Selling it how? If they put my blog posts in a book and try to sell that book, that’s copyright infringement. If they put my blog posts in an ML model corpus to train a translation service, and they then charge pay-per-use access to the resulting service... I don’t think I’d care, nor do I think there’s anything morally or legally wrong with that. If they scrape my name and phone number and generate a Yellow-Pages-like index from them? That’s explicitly allowed by law; and heck, that’s why I embedded the information onto my site in vCard microformat in the first place!
To put my philosophy succinctly: if web.archive.org can scrape your data without you having an explicit relationship with them granting them that right, then bad.evil.com can too. You can allow both (= publicizing your information), or neither (= protecting your information), but you can’t allow one but not the other. “Third parties you don’t have a relationship with, who access your data through the public sphere without entering into a specific licensing arrangement with you” are legally one big amorphous blob. You can’t make a law that splits that blob up, because it’s an opaque blob; in the ACL system that is contract law, all entities you don’t have contracts with are just one entity—“the public.” If you want some specific entities to have access to your information, that’s what protecting your data (= setting an ACL “the public = disallow”) and then explicitly licensing it out by entering into contracts (= setting an ACL “entity X = allow”) is for.
Why can't I have terms on my website that say how you can use my information?
Examples where this is allowed:
- Images/media (Creative commons)
- Code (Open source licenses)
You say it isn't allowed for:
- Personal data
Unless I'm misunderstanding your philosophy (which seems to say copyright is OK, but public information must be public to all): You believe that it's morally OK for me to prevent a company selling my book, but not morally OK for me to prevent a company selling my name, job title and employer as a marketing bundle?
Edit: An aside, it's really confusing that you seem to be editing your previous replies minutes after I responded. I thought HN only let users edit during the "no replies" period?
>Why can't I have terms on my website that say how you can use my information?
>- Personal data
So there's a couple of things in play here. You can't (generally) copyright facts - "Cthalupa is a Rocket Surgeon for the Space Force since 2001", if true, would not be something that I could get a copyright on.
The second thing is that terms have to be agreed upon by both parties. If you give me information without us coming to an agreement on terms, I can't be bound by them. If you just put a link to a TOS on your website and don't require people agree to it before giving them access to data on your website, we did not enter into a contractual agreement.
> Why can't I have terms on my website that say how you can use my information?
Neither Creative Commons nor copyleft (nor copyright in general!) can assert anything about private use. IP rights are commercial rights; they affect sellers of your IP. They don’t affect end-consumers of your IP.
Note that even the GPL can’t force someone to publish the source of their GPLed-library-containing program, if they never publish the program itself, but only build it for their own private use.
Why? Because, by broadcasting the code of your GPLed library, you granted people an implicit use-right to it! Not a redistribution right; not a derivative-works right; but a use right. (If this wasn’t true, then people would be breaking the law by reading “common” newspapers in a cafe, or by listening to the radio, since they never entered into any explicit contract with the distributor/broadcaster.)
How does software licensing work, then? Mostly by 1. companies installing software on computers for their employees to use being considered IP redistributors; and 2. attachment of copyright through sampling when asset samples [e.g. brushes/textures in Photoshop] are distributed through the program. Other than that, there’s really no law forcing end-users to pay for software licenses. This is why e.g. WinRAR would never have been able to sue anybody. They published their shareware binary (without gating it behind a contractual relationship, like Adobe’s Creative Cloud installer); so now you have a use-right to it!
> You believe that it's morally OK for me to prevent a company selling my book, but not morally OK for me to prevent a company selling my phone number?
Copyright exists because your ability to make money from your own creative works hinges on your ability to exclusively license those works. If a publisher can get a redistribution license to your manuscript for free from a third party, why would they buy it from you?
You having exclusive access to your phone number does not make you money; others having access to your phone number does not deprive you of money you could have made by keeping that information private. Thus, there’s no advantage to introducing IP law into this domain (the domain of facts.)
There was a recent court case about someone creating a subway map by copying the raw data from existing subway maps, where the comments went deeper into this.
Right, sure, I hadn't really thought about how copyright isn't really enforceable against individuals. That's very interesting.
However, I'm really not sure how this is relevant to your moral stance on commercial use of "public" personal information.
Why do you believe it's reasonable to prevent unauthorized commercial exploitation of creative works, but not to prevent unauthorized commercial exploitation of personal information.
The former simply affects the small percentage of people who sell their works.
The latter affects the vast majority of the population who receive targeted spam, have their information collated and sold for profiling, are victims of identity fraud when those databases are inevitably leaked, etc.
For what it's worth, as I mentioned in my first comment, the GDPR absolutely gives me rights to control how my personal information is used. And the GDPR has a near total exemption for individual use.
What benefits do you see of commercial use against the wishes of the person that published it that outweigh the risks? (making money isn't a benefit)
I think you got the wrong idea if you were thinking I was saying copyright isn’t enforceable “against individuals.”
My example of copyleft was specifically about the thing the Affero GPL tries to avoid (to unknown success): the possibility of someone using GPLed libraries to set up a commercial web service. Because they never release the binary, but only have people interact with it over the Internet, there’s no derivative work being made available in the commercial domain. So copyright doesn’t apply. Even though you’re a company making money off GPLed libraries!
I have a linkedin so that I can point people at it. I also want human recruiters who have actually read the thing to send me relevant jobs. If my profile ended up affecting my credit report, I'd be pissed. I expect you would be too.
People put data places for specific purposes (to show recruiters) and want the ability to limit use to that purpose. How that's accomplished is just a technicality most people don't care about.
Not so sure about that. Messages on LinkedIn are mediated in a single place, and you have a measure of control over how your profile shows up in searches. If your CV is scraped, you could end up anywhere and now you're getting recruiter spam from all over when you're not interested.
> If your CV is scraped, you could end up anywhere and now you're getting recruiter spam from all over when you're not interested.
I would point out that this is still possible (even probable!) without any bots being involved at all. Back before CVs were online, humans working for recruitment agencies would “scrape” information from local, physical job boards by hand into their company’s databases (where “database” here could just mean a filing cabinet.)
IMHO, the real solution to that is a spam filter (or an “agent”, in the old world.) Just because a lot of people want to talk to you, and most of them aren’t very interesting, doesn’t mean they need to be prevented from accessing you—they just need to be prioritized by interesting-ness, which is something you can do yourself, or hire a service to do for you.
I think the GP in this context means, e.g. LinkedIn wants their information to be public in the cases where it benefits them as a business. But then they want it to not be public when it doesn't benefit them. There is no such thing as "public information, except ..." - information is public, or it's not. If none of LinkedIn's data was public, they would have a much harder time getting people to sign up, and having as many users signed up as possible is part of their business model.
From a copyright perspective (since that's what LinkedIn's lawyers claimed): imagine if a newspaper sued another newspaper, saying that - not just the content of its paper - the information in the newspaper was copyrighted and could not be accessed by "unauthorized" third party companies. Either you print it, or you don't!
I want to be able to use LinkedIn to network with colleagues and people in my industry. If someone wants to scrape my profile to make a report on industry trends, I’m fine with it. What I don’t want is hiQ vacuuming up my data so they can snitch to my employer if they think I’m job hunting.
How is this a paradox? Tech — the web in particular — is supposed to be an equalizing force, but HiQ is clearly trying to give my employer more power over me. We are an industry that prides itself on solving difficult problems — how is our response here to just throw up our hands and say “it’s all or nothing”?
Wow, I just looked up what hiQ does and have to say it's pretty scummy in my opinion. Why do people create stuff like this? Don't they know it will likely come back to bite them one day?
For reference:
"There is more information about your employees outside the walls of your organization than inside it. hiQ curates and leverages this public data to drive employee-positive actions.
Our machine learning-based SaaS platform provides flight risks and skill footprints of enterprise organizations, allowing HR teams to make better, more reliable people decisions."
The thing is, GDPR has theoretically solved this in the EU. The UK's ICO is about to publish guidance prohibiting scraping public user information for marketing (where the user would not expect it to be used for that).
It's a really easy solution, because companies need to prove how they got your data when asked.
When you track the source of the mailing list you're getting spam from and they say "We scraped it from LinkedIn", they get fined.
We haven't solved that in the same way we haven't "solved" encryption not having a magical good people only door despite spook tantrums. There fundamentally isn't a possible mechanism and really wanting it doesn't change that.
It is a result of equality - not of outcome but rules. Open for everyone but those whose applications you don't like isn't open. On a technical level trying to prevent it is like the "evil bit" as a solution to malware.
Of course there are possible mechanisms. There are heuristics to detect bots. The whole reason for this lawsuit is that LinkedIn blocked hiQ from scraping their website.
I'm also not necessarily talking about a technical defense against unwanted scraping. Write a law makes it illegal to do something like "scraping personally identifiable information and storing or presenting it non–anonymized", and prosecute companies who break it. I'm sure there are loopholes in that particular example, but the point is we can absolutely add shades of gray here.
> Open for everyone but those whose applications you don't like isn't open.
Openness should be a means, not an end. If we make something "not open" but it prevents 95% of undesirable uses and only 5% of desirable ones, is that not a tradeoff worth discussing?
"public" is not the right concept here I think. E.g. imagine a composer conducting a public airing of some work of music (e.g. on some festival). That you were able to hear the music in public doesn't mean the {composer,artists,...} give up their copyrights.
I think copyright as you mention here is the right concept, or at least a lot closer. In particular, the limits on copyright. If someone is reciting a list of facts in public, they can’t expect people not to record those facts, because copyright doesn’t apply to that. Reciting the list in public using computers shouldn’t change that.
Agree with this, and I would also add this
- I want to be in control of my data and change the setting.
- I want to be able to delete my data.
If scraping on linked-in is banned (and linked-in is enforcing it), then I do have have control of my data, since I can change the setting, and it will no longer be public (It's not perfect, since some might already scraped it, but the extend would be much smaller). Also, if I decide to delete my data, linked can do that for data it control, but not for scraped data.
Scraping information is not the same as posting. There are a number of bots that scrape Hacker News and people here generally consider them pretty cool.
Isn't that a different paradigm, though? A posting bot set loose on a forum/platform will (normally) visibly degrade service in a much more visible and impactful way than a scraping bot. And in either case, writing (and running!) a bot that posts on HN is not illegal behaviour in itself.
I'm on Hackernews to see interesting articles and read interesting conversation. If a bot can post interesting articles and make interesting conversation, I'm not sure I care that it's a bot. And if a human can't do those things, I'm not sure I care whether or not they're 'real'.
That focus on "we don't care if you self-promote, we don't care why you're here, we just want you to be a good citizen" is part of why I like HN.
It's not completely black and white, but in general I believe that users online have the Right to Delegate[0]. That right should only be legally taken away if there's a really, unbelievably compelling social justification for doing so. I am pretty skeptical that banning web scraping has that kind of justification.
I don't really care if the comments are from bots, per se. I care if they are quality comments or not. Whether or not the comments are from bots is just a proxy for whether or not they are actually good.
Humans cannot directly access websites - a machine is always involved. I know perfectly well what you mean but the distinction on a deeper level is fundamentally imaginary.
The best is some sort of heuristic like captchas and even they can be outsourced so that the human doing them isn't actually viewing the content.
The thing about a bot which bothers people is the behavior anyway. A bottish acting human would get people just as upset.
Yeah, I have nothing at all against scraping per se, it's more about the huge bot traffic the commercial scrapers generate, which would ALSO be fine, but 1. it can be hard to tell scrapers from malicious DDOS bots sometimes and 2.) the person being scraped literally pays for that scraping traffic.
Yeah, but where do you draw the line between 'oh sorry we dropped your request because of rate limiting' (or whatever mitigation strategy) and 'oh, we dropped your request because u scraping us bro' legally? IANAL, but this lawsuit seems to indicated that putting barriers in front of scraping attempts is a no-no.
There's plenty of grey here. For example, scrapers that try to check people in for flights to get better seats. Some that tried to charge for that. That creates problems, where some customers benefit at the expense of others, high load on a "locking type" piece of code, etc. Similar for ticket sales for concerts, and probably other spaces.
There are also companies that provide added value by compiling and correlating "public info" in a useful way that creates value. If Google let me scrape their search and remove ads, it would be popular, but is it "legal"? Or maybe Google Maps?
I would think, and of course could be wrong, it would be as legal as Google scraping all of the web sites that they do in order to create their search engine in the first place. In particular, Google provides cached versions of web pages. That's pretty hardcore scraping.
Another problem with web scraping is a B2B website offering services for B2C companies to better reach consumers. It can be a tricky thing to do without basically giving your clients list to all your competitors.
Sure. My data is still my data, and if I publish it on my platform for free, that still shouldn't automatically give you the right to copy the data and provide on your platform.
It's basically the same as a TV broadcasting a film for free, and then going after you legally if you recorded that film and uploaded it to your website.
This does not legalize theft, it says sites cannot respond to suspected scrapers differently than they respond to non-suspected scrapers. You can still rate limit, as long as you do it universally to all site users
Copyright law is unchanged. If someone scrapes your blog and then re-uses your posts on their own blog, you still have possible copyright infringement claim
This is a bad analogy as scraped but copyrighted works are still protected by copyright. Whats in question is whether you should have exclusive rights to information you have shared but did not copyright. Seems people also think this noncopyrighted data should also be protected even if its munged and added to an original work by the scraper.
Billions are being made collecting data and content for free and publishing it along with targeted advertisements. Web-scrapers can collect that aggregated data and redistribute it or create competing services. This is terrifying to certain tech giants and threatens their moat's and lock-in/network advantage. They will lobby hard against it with PR campaigns citing everything from security, privacy, copyrights and all sorts of other exaggerated bogeymen. At the end of the day its mostly about preserving their monopolies though.
I can understand why some do not want scrapers - increased traffic (with practically zero benefits to the owners) is one obvious reason.
(Some people will then say "But why not just offer APIs", but that's a lot of extra work and maintenance).
It's like with instagram and other social media platforms. The content creators put in the hard work, while the leeches are stealing content for their own benefit, giving zero credits to the original content creators.
Let's not conflate a) a person's personal data, and b) a business's dataset. The GP and the article are clearly referring to the latter. Preventing web scraping won't protect users from businesses collecting their data.
What I'm trying to say is: For every popular content creator on IG, there are tens and hundreds of (more or less automatic) content curators that do nothing more than scrape content with lots of likes, and re-post on their own channels. Then when they get sufficient followers, they make money through paid product placements, account flipping, pay-to-play sharing, and what not. More often than not, there's no linking to the original pages / creators.
People want their data to be public and all of the benefits that comes with public data but then they want to chose who gets to see it.
By People you mean business or people people? Because I don't think people want everything to be public, many in fact use various networks to avoid oversharing and even then many people don't want their old bosses or exes looking at their profiles, there just don't exist tools to limit access that granularly.
I don't think the direction would ever be clear, even if the legality were clearly established. The arms race would intensify, and detecting/blocking/deceiving scrapers might become a lucrative field.
Companies want to provide some information to some people; but providing all information to all people is analogous to allowing customers to make a meal of free food samples, on a recurring basis.
The arms race would intensify, and detecting/blocking/deceiving scrapers might become a lucrative field.
It already is. There are entire companies, like Distil Networks, who exist solely to protect companies from bots/scrapers/etc. Actually, looks like Distil got acquired and are now part of Imperva, but anyway, the idea is the same. This is definitely an existing field.
Disclosure: former Distil employee, but I have no financial stake in this discussion, and have mixed feelings about scraping. Clearly it can be beneficial in some situations, but when I think about having to pay exorbitant prices to scalpers for tickets to an event, because they used a bot to buy up all the tickets, that is less appealing.
> People want their data to be public and all of the benefits that comes with public data but then they want to chose who gets to see it
Can you really not imagine a world where a person accidentally or in poor judgment uploads something private to their own site (their real name, home address, credit card#, or any piece of highly damaging information that could cost them their careers) and wishes for it to be removed? (but can't because many of these scraping sites never respond to takedown requests)
People make mistakes and post things they shouldn't. A mistake from someone many years in the past that they've made amends for shouldn't haunt someone for the rest of their lives.
But it does when we decide that every single line of text ever uttered online must be preserved and easily accessible by anyone for all eternity.
Blocking scrapers is an arms-race escalation because these sites refuse to remove content, and it's used as a tool for character assassination by bad actors. It's a proactive defense.
It sounds like the same old issue - lack of understanding of the fundamentals underlying technologies.
Otherwise they would realize what they demand is contradictory and incoherent like demanding to be both viewable by all and not viewable. DRM is one fundamental example of it.
they want there data to be public for a specific use, I feel like that's pretty easy to understand.
LinkedIn: they want their info to be public -> to get jobs
This precedent doesn't really mean much but is definitely step in the right direction.