No ultimate decision was ever made, and no, this doesn't make web scraping 100% legal. Wake me up when there's a new announcement because anyone interested in this already know this old news.
This is a really big deal. Currently (IMHO) the US Supreme Court is a wholly-owned subsidiary of multinational corporations due to the shenanigans that happened with Obama, McConnell and Garland, so will likely side with LinkedIn since it's the larger corporation:
I feel like siding with LinkedIn here would open up the web to extortion though, like troll companies that would send cease and desist letters to all scrapers (even search engines). I think it could be argued that letting one company scrape when another is denied is discrimination.
Then again, I don't know how conservative and republican-leaning courts decide corporate law. Maybe in this case since so much money is at stake, they might worry that banning scraping would infringe on something like free speech and ruffle the feathers of some of the wealthier contributors in their base. Especially on the media side since I imagine they use bots in one form or another to find newsworthy stories.
IANAL (obviously!), I just find it entertaining/dismaying to ponder these things in these times.
The thing is, this is still all about a preliminary injunction. Even if the injunction is found to be without merit, that still doesn't provide a final answer to of if LinkedIn can successfully sue HiQ to force HiQ to stop scraping under the CFAA.
> Sign up now for free access to this content
> [...]
> Email (NOTE: Free email domains not supported)
I think this sort of antinomy[1] sounds ironical especially in this thread. This is practically legal dependency injection pattern, when you call something free, then administer antinomies, catches-22, etc.
"HiQ only takes information from public LinkedIn profiles. By definition, any member of the public has the right to access this information. Most importantly, the appeals court also upheld a lower court ruling that prohibits LinkedIn from interfering with hiQ’s web scraping of its site."
Surely I'm not reading this correctly. This would seem to suggest that websites are not legally allowed to prevent bots from crawling their sites. Lots of sites have ToS preventing such things, are those legally void now? Are captchas on public pages illegal, even if you request the page 8000 times in a second?
"In this case, hiQ argued that LinkedIn’s technical measures to block web scraping interfere with hiQ’s contracts with its own customers who rely on this data. In legal jargon, this is called” malicious interference with a contract”, which is prohibited by American law."
This is almost weirder. If LinkedIn wanted to force users to sign in to view profile info, would they be not allowed to do that because some company had signed a contract that implicitly assumed access to that data? If someone writes a web scraper for my site, and I unknowingly change my site in a way that breaks that scraper, can a court force me to revert the change?
Seems to imply that every business is somehow beholden to every contract signed by anyone.
> Lots of sites have ToS preventing such things, are those legally void now? Are captchas on public pages illegal, even if you request the page 8000 times in a second?
ToS are subservient to the law; you can (probably) terminate a service account from a user that breaks your ToS, but if the user does not have a service account (as is the case for HiQ, it doesn't seem they were using accounts for it), then your ToS does not apply, since you've technically not entered a binding legal contract with them.
> This is almost weirder. If LinkedIn wanted to force users to sign in to view profile info, would they be not allowed to do that because some company had signed a contract that implicitly assumed access to that data? If someone writes a web scraper for my site, and I unknowingly change my site in a way that breaks that scraper, can a court force me to revert the change?
IANAL, but I believe that'd fall on intent, and intent is often difficult to prove at a personal level, but not necessarily at a company level. If your intent for putting up barriers that happen to impact scraping, whatever they may be, was indeed to knowingly prevent scraping from a particular company, then you may be liable under this decision. This is the only part of the decision I'm torn on, since it's a bit messy to really prove such things. I'd be much more comfortable with allowing companies to take whatever measures they feel necessary to prevent scraping, and also allowing scrapers to legally circumvent those measures without threat of prosecution, assuming they didn't actually hack into anything.
> but if the user does not have a service account (as is the case for HiQ, it doesn't seem they were using accounts for it), then your ToS does not apply, since you've technically not entered a binding legal contract with them.
Are you sure about this? I am not a lawyer, but I believe that the Terms of Service applies to all users, not just those that explicitly set up a user account.
I have interpreted the LinkedIn ruling to mean that scraping public data is no longer criminal activity but it still leaves you open to civil lawsuits for violating the ToS of the website you are scraping.
> Are you sure about this? I am not a lawyer, but I believe that the Terms of Service applies to all users, not just those that explicitly set up a user account.
How would that even work? If I browse to any random public page of your website, it's served to me before you've even transmitted the terms of service. How could I be bound by those terms of service when I haven't even seen them?
As an engineer, I agree with what you are saying, but I think normal people and the courts disagree.
I think these sorts of contracts are called Adhesion Contracts (https://www.investopedia.com/terms/a/adhesion-contract.asp) and we interact with them all the time. For example, if you valet your car, the valet will hand you a piece of paper with a number printed on it to retrieve your car. On that paper you will find an adhesion contract that is valid and real (although not as powerful as the types of contracts that you sign)
This does not work at least for software licensing based on precedents for shrink-wrap contracts, so again would not work for licensing use of data.
A paper served you by the valet is not an immediate contract as you can deny agreeing to it and service does not happen.
You cannot do that with a publicly visible website, unless you show ToS and require agreement before first use.
If you allow a non-transferable license then said data cannot be used by a search engine. If it's transferable you just pushed the problem towards scraping a different bot.
(Well, you could have a direct agreement with a few major search engines.)
IANAL, but it seems like ToS could still govern your use of the data which you viewed. Sure, it seems like you couldn't claim any violation based on visiting a random page. But if the ToS is clearly identified on the page and you do something with the data that violates them, perhaps the owner of the site has a case.
Except it sounds like the owner doesn't. If the information is on the page made public, the owner of the page can't place terms on what is done with the data downstream. They'd have to implement some real binding system such as authentication where CFAA would apply. (IANAL)
Correct, but all of that is void if the data presented is any sort of protected information (copyright, IP, etc.). You can't, for example, scrape Yahoo Finance for pricing and dividend history and republish on your own stock tools website. They have a license to redistribute that data and publish on their own website. Similar story for copyrighted text and things of that nature.
That would require at least showing that ToS on first use. A link on a page is insufficient.
And said ToS would have to force copyright reassignment rather than a general licence, making LinkedIn culpable for any unlawful content published by users of its site.
I am a lawyer, and there isn't really an easy answer to these questions.
TOS are a lot like EULAs. If they look like contracts of adhesion, then they're going to get more scrutiny and skepticism. The TOS that you claim applies even to every single random visitor to your site where they do not in fact affirmatively agree to the terms is potentially going to look more like a contract of adhesion. That's a lot harder to enforce.
If they are used more for CYA so that you can ban undesirable accounts from your website which people explicitly agreed to when they signed up for it, or so that you can just up and alter your entire business model without having to give all of your customers refunds, then they're easier to defend.
Just my general opinion, of course. Every jurisdiction is different.
Also not a lawyer, but you cannot force me to accept your terms of service. Contract law requires both parties agree to enter it.
When you create an account, etc., you are agreeing to those terms. If I browse a public webpage that just has a terms of service link on the bottom of it, I've not agreed to anything.
> Are you sure about this? I am not a lawyer, but I believe that the Terms of Service applies to all users, not just those that explicitly set up a user account.
Typically you'll see TOS say something along the lines of "by continuing to access this site you agree..." or "if you do not agree with these terms you may not access this site..."
Whether that's enough to create a binding contract depends on the jurisdiction and who you ask.
It can also depend on the terms themselves. I can put "by using this site you agree to bake me a chocolate cake" on my website all day, but that doesn't mean I will be able to force you to bake me a chocolate cake.
From the article, the LinkedIn decision was that scraping data does not violate the Computer Fraud and Abuse Act. Violating that act was considered to be criminal activity. (https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act)
But the claim of a violation was only a claim as part of a civil trial. The law has both civil and criminal elements to it, and this is about the tort part of the law.
LinkedIn made threats accusing hiq of criminal behavior, but that doesn't mean there's any criminal precedent being set here, as far as I can tell. And no one was criminally charged.
Separately, part of the ruling states that for the purposes of authorization, defying a cease and desist letter does not constitute illegal access, which might have some criminal implications. They imply some sort of technical authorization system must be bypassed, which didn't happen, since the data is "public."
(Which doesn't square well, imho, with existing meatspace law. If a public serving business banned someone from their store, the door being unlocked isn't an excuse to ignore that ban and trespass. But I digress.)
With the overlapping areas of law, it's admittedly beyond my understanding. But the law is generally viewed, like dmca, as being overreaching, if not at least partly unconstitutional.
The CFAA is overreaching, and used often as a catch all. 'Reply All' has a good episode which explores this. This is actually what was used against Aaron Swartz when he was charged for downloading academic journals from MIT, and why his charges were unjustly severe.
There's a long, long history (probably hundreds –if not thousands– of years old) of selling aggregated or processed publicly-available information.
I'm not particularly thrilled with it, but enough people think of it as a valuable enough service to pay for; even if they know they could get it themselves, for free.
LinkedIn users (as opposed to the company) might actually like what HiQ is doing, as it may help their own prospects.
It is true in the current situation, though I would prefer that we ensure free data must be free. In that case buyers of data would be incentivized to pressure providers of free data to improve the data quality.
The data does remain free, as long as LinkedIn still provides it for free.
The data without the noise is what you're paying for. The service of winnowing out what you care about from what you don't care about.
Considering how big of an effort it is, and that the source from which it came is still available, why should the cleaned data be free? If I collect fallen trees from public land, chop it into usable firewood, should my bundles of firewood also be free? Or I collect solar power with my own solar cells, should I have to give you the electricity for free?
I think this is especially relevant when it comes to things that fall under disclosure & transparency requirements - a lot of information that is legally required to be made available isn't legally required to be convenient. So, as a patient, you may have the absolute right[1] to a free copy of the charge master[2] of a hospital you're admitted to but it could be required that you pick it up in person or that it is only supplied in microfliche form... so a company that's aggregated this and is reselling it can deliver real value.
1. This specific example is BS but plausible - I just wanted something more specific than the vagaries around things like FOIAs or shareholder reports which both have specific facts that can be rendered useless unless you have the context.
I'm thinking of processed GIS data. If you have ever tried using the various formats that are supplied by government sites, you know what a huge pain it is.
I'm happy to pay a reasonable price for an interpreted and bowdlerized version.
I actually have! I had to import a huge file of all of the culverts around storm drains in a state, and each culvert was multiple pieces of geometry, none of them grouped together in any logical way. It was just a huge list of rectangles that looked like culverts when viewed visually but no way to identify them as being one culvert without heuristics on how close each rectangle was to others. Massively long process that should not have been so.
The data is free, but the aggregated formatted data has been worked on and processed, are you saying the resulting aggregated data should also be free? That isn't going to happen, why would anyone do that work for free?
Or are you afraid linkedIn and others will make everything private? That's completely up to linkedIn or individual linkedIn users what they want to make private vs public. Maybe more data would be made private if they don't want it scraped. I don't think that's inherently a good or bad thing.
I'm trying to puzzle out how this works in practice. So if LinkedIn has truly public data (no login required to view) then it can be scraped no problem.
But if it's only accessible with a login, then it falls under TOS and they can be blocked?
> Surely I'm not reading this correctly. This would seem to suggest that websites are not legally allowed to prevent bots from crawling their sites. Lots of sites have ToS preventing such things, are those legally void now? Are captchas on public pages illegal, even if you request the page 8000 times in a second?
This is just a preliminary injunction. This wasn't an actual ruling on the case. This just says that until there is a ruling they can't stop the scraping to make sure the company isn't put under while waiting for an actual ruling.
You don’t understand what a preliminary injunction is then.
It’s a very, very strong indication that they will win. Courts don’t issue preliminary injunctions unless it’s extremely likely the side who won the preliminary injunction will win.
Huh, I thought in USA they also did them to avoid an injunction having the effect of making the judgement irrelevant. So, where the case is not clear cut the injunction could prevent one party acting to 'kill' the other (and so avoid judgement) in the meantime?
Could you cite something on this that indicates this (my understanding here) is wrong?
It only requires a “substantial” likelihood that side will win (not an “extreme” one), which basically means there’s a substantive dispute. The more difficult criterion is a substantial likelihood that irreparable harm will occur if the injunction isn’t granted (irreparable harm is supposed to be a pretty extreme thing — it means you can’t fix it with any amount of money).
> This is almost weirder. If LinkedIn wanted to force users to sign in to view profile info, would they be not allowed to do that because some company had signed a contract that implicitly assumed access to that data? If someone writes a web scraper for my site, and I unknowingly change my site in a way that breaks that scraper, can a court force me to revert the change?
LinkedIn has long wanted to have their cake and eat it too - they advertised that data as being publicly accessible and allow google to index specific user pages but then attempt to restrict other bots from crawling it.
If you have private data behind a login there isn't an issue here - if you have public data but want some people to login before viewing it (or not be able to view it) then that's where this ruling comes up. So, this mostly hits sneaky SEO folks and dark UX patterns that rely on tempting someone with accessible data and then pulling the rug out from them at the last minute.
If your website places data outside of authentication then everyone should be able to see that data... I'm curious to see the specifics around
> Surely I'm not reading this correctly. This would seem to suggest that websites are not legally allowed to prevent bots from crawling their sites. Lots of sites have ToS preventing such things, are those legally void now? Are captchas on public pages illegal, even if you request the page 8000 times in a second?
though - DoS attacks are clearly illegal, but with this precedent there's going to be a lot of back and forth to see where the line between DoS and scraping falls... and I think that makes this precedent a lot weaker than the headline would have you believe. A company can still threaten to drag you through a lot of litigation by accusing you of malicious page requests, it'll take a few cases to define where that line needs to fall.
This reminds me about Twitter, when I click to see a thread for a tweet it asks me to login, but if I open the link in a new tab it loads the thread just fine.
Linkedin want their data to be scraped by bots so they have to keep it public, otherwise you wouldn't find peoples profile from Google. They just don't want bots from from their competitors like hiQ to scrape it.
To me, this is crucial. If it's public and available for google, it's public and available for everyone. If you want content to be private, then make it private and accept that you won't get search engine traffic. Otherwise, don't be surprised when your publicly accessible content is accessed by gasp the public.
In other words, the judges said that LinkedIn couldn't use the US legal system to force HiQ to stop. Judges didn't say that LinkedIn was barred from using technical measures.
The court did allow a preliminary injunction against LinkedIn, due to the possibility of "monopolies" (to be determined in Court later), pending resolution of that latter question.
LinkedIn might still win their claim to their right to block scrapers via technical means.
LinkedIn can't prevent HiQ from attempting to scrape their site through force of law.
LinkedIn can rate limit requests, make their site hard to scrape, change their format, whatever. LinkedIn is in no way responsible for how HiQ fulfills its contract to its customers. HiQ is attempting to say that if I sign a contract to provide you with a Tesla, then it would be illegal for Tesla to stop me from just taking one from them to give to you. If that sounds stupid, that's because it is.
The court document says "... refrain from putting in place any legal or technical measures with the effect of blocking hiQ's access to public profiles." on page 11. I wonder if they mean targeted measures specifically blocking hiQ but allowing others such as Google.
> hiQ also asked the court to prohibit LinkedIn from blocking its access to public profiles while the court considered the merits of its request. hiQ won a preliminary injunction against LinkedIn in district court, and LinkedIn appealed.
Whether LinkedIn is the good guy or bad guy here doesn't matter when the decision creates precedence for the rest of us.
Surely a healthier precedent is that we can respond arbitrarily to requests and have no obligation to the requester. So what if I want to randomize the html structure on every request or block requests from Tor because 100% of them are abuse? Can someone take me to court on the grounds that either is effectively "blocking" their scraping syndicate? Why not?
I feel like once CFAA is off the table (which I do agree with), the cat and mouse game is a fair middle ground. Keep web scraping a sport!
There is a large banner next to the highway that shows some weather information that if properly organized (lets say monthly almanac) you would find people to pay money for it. The banner owner does not make money this way - he ask you to go to his website and signup for an account. But you drive the highway (internet) every day, look at the banner, write down the weather updates, and then offer them on your website as a sale. The owner gets angry and sue you. The court decides you are free to drive by the highway and you free to put your eyeballs on their weather banner, especially given the banner is available to everyone (LinkedIn profiles are avail to view without needing an account) and you are free to use the information you obtained for free without interference with said banner in a form of a monthly almanac that you sell. At the end of the day, the banner owner does not own the weather information that someone else put in there (for example a weather meteorologist).
I think personally its a healthy decision. Otherwise it would be similar to prejudice of who should be allowed to enter and browse a street store that by law is available to everyone.
This would mostly mean that you cannot start interfering with webscraping you previously allowed merely because you learned that they're making money with the scraped data.
It seems absurd if the 'interference' only directly affects their own property. Like, if my neighbors start monetizing livestreaming my backyard, suddenly I can't put up a fence? Except worse because in actuality, this third-party contract is costing them money through server load and bandwidth.
Your analogy doesn't hold. Your backyard is private property. The data that LinkedIn publishes is intended for the public. That's why Google can index the pages and give you results from LinkedIn.
It does, in the US. You're likely making an inconsistent comparison.
Property ownership has nothing to do with visual access.
You cannot legally be barred from casually (involuntarily) perceiving something. It's reasonable to put up physical barriers to reduce what is casually perceived. It's a very good analogy.
However it doesn't hold - as your neighbor I can't bar you from putting up a fence because it'll intrude on my view of your property... granted people try to do that _all the time_ but I think it's commonly understood that putting up a fence for privacy is allowed.
It's also not a great analogy for this case because another party is given continued easy access to view my backyard while the first party is denied - and the analogy breaks down here because, as a neighbor, I have no inherent right to view your private life at least as much as any of your other neighbors.
It's trivial to fix that - the exterior of GP's house then. That's available for public viewing; is intended for it, but is private property. If you monetise livestreaming it and describe it in your ToS, GP can't repaint the front door, or get new windows?
Or perhaps slightly less contrived:
If I publish a monthly lowlights reel of my favourite sports team as a podcast discussion on where they can improve in all their lost games, and then they suddenly go on a winning streak for >1month so my USP is gone and I have nothing to talk about..?
Those examples don't fit because they are contracts not made in good faith. They aren't things you can control.
In this case, it was rules that the public data is available. It was a good faith contract on the part of HiQ to assume they could collect public data from a public website.
It would not be a good faith contract to assume you could control the paint colors on a property you don't own.
It seems to me that the interference ruling was wholly independent on deciding that what hiq was doing is legal.
Does that mean that ia grocery store offers free samples, I can go in every day and take all the samples, and the grocery is not allowed to selectively prevent me access?
It means that if they're offering free samples and refuse to offer you the same service they're offering to other customers they might be in hot water - which is consistent with what a lot of folks consider ethical. Offering an item for free to some folks and not to others is a form of discrimination - it's usually not a particularly troubling form of discrimination but in this case Google is allowed to walk up and take all the samples and the grocery store manager just smiles and nods - but when you (hiQ in this example) try and get one you're hit with an injunction and barred from entry.
I mean, anyone can be sued for anything. I can file a lawsuit with basically zero legitimacy to it. It'll probably get thrown out, but you were still sued.
If the question is could someone win, potentially. The argument would basically have to be that the removal of that open source project is akin to other cases of negligent interference.
If this is a specific concern, consult a lawyer - 'cause I'm not one.
Exactly your backyard is of course yours. But you are not at liberty to use it to damage others. There's lots of rules about this. For example, opening a brothel on your own land is definitely not legal without considering how it affects the neighborhood.
Are you doing it just to spite scrapers, i.e. with "malicious intent"? If you have some other reason, you won't be guilty of intentional tortious interference.
They want search engines to index their profiles and provide organic search results links to their site, but then those same sites will require you to sign in when clicking a link to another public profile. You can search for that 2nd profile in Google and then view it without signing in, but not by clicking internal links. I've experienced this with Quora, LinkedIn, Instagram, FB and others. They want to have their cake and eat it too.
As a user of LinkedIn, I can pick which portions of my profile information I would like to be publicly available. This is not by default, so most people do not have it public. You can try seeing my profile without logging in. :-)
Your second point is interesting. I suspect the contract between hiQ and some company is that hiQ provides info on public profiles, and if LinkedIn removes all public profiles by requiring a login the contract would become moot. Just the same if I was to change my profile settings from public to private, hiQ wouldn't be in breach of their contract (nor would I).
Scrapping should either be legal or not. The fact that you have a contract to sell the content you assumed it's legal to scrape, should not matter. Too bad if you lose money
They were pretty much legally void even before this precedent was established. They are only valid when they don't violate any existing U.S. law. Any authority assumed beyond that is completely false.
I wonder if it has anything to do with the fact that the data is actually owned by LinkedIn users, and they expressed that they want their data to be publicly available?
Unlikely. The license to LinkedIn retains ownership, but the user's retention of information ownership doesn't compel LinkedIn to affirmatively do things with that data (i.e. LinkedIn isn't forced to vend the data to a given consumer if the user says so).
The license further goes on to clarify that LinkedIn will vend public data to search engines, but the definition of "search engine" is almost certainly assumed (by LinkedIn, at least) to be up to them.
The toxicity towards web-scraping is really what makes me lose hope in the current web.
People want their data to be public and all of the benefits that comes with public data but then they want to chose who gets to see it - it's a complete and utter paradox.
This precedent doesn't really mean much but is definitely step in the right direction.
The issue here for some, if not many, is a matter of scale. It is one thing if an end-user, whom I am trying to service, comes to my site and gets my publicly available data. Maybe I monetize with ads, maybe not. It doesn't matter, that is the audience I am trying to service, regardless of size.
But when you scrape it my load goes up dramatically. A load I have to pay for.
It is analogous to the privacy debates going on with one said saying "hey, don't track everywhere I go and tag me with facial recognition" and the other side saying "hey, you are in public and people can see you." The issue is not complete privacy, but one of scale. And of intent.
I believe society is soon going to have to come to grips with the scale of things and legislate what are acceptable scales of action as it seems to be becoming a large issue in a growing number are areas.
So you throttle your users. We have http status codes for "too many requests" and all scraper software comes with a delay setting by default. Everybody who does scraping is supposed to know that its rude to blast a thousand requests per second.
This ruling has left open a big question of how much you need to spend to support scrapers and where the line between scraping and a DoS attack lies - and that's going to be a weird line. If my site is producing a big report off of data that changes quarterly then re-downloading that report every 20 minutes is possibly excessive and might wander into the realm of an attack - while if we looked at the same frequency with twitter it seems a lot more reasonable - maybe even a bit on the slow side.
Entirely feasible. Also reasonable for you to pay me for the service as it is taking my development efforts to meet your business model. The advantage to you is you have a defined interface that I won't prevent.
I guess you missed the comment I was replying to: it may cost you more money, in bandwidth and per page resources, to not provide an API than it does for you to provide one.
So no, I won’t pay you for the privilege of you saving money.
Who pays for that API and the bandwidth? What’s in it for the data provider? On LinkedIn, viewing the data now shows ads or at least prompts the viewer to join the network. With scrapping and free API access, how exactly does LinkedIn benefit for their work of hosting the data?
Unless the costs exceed their current operational costs. Don't forget the time spent redeveloping on the new API, which includes validating everything is there, testing and cleaning up and removing the old (working) code.
This isn't a great analogy here - getting the data delivered via API is simply more useful than having to re-assemble that data out of fragments parsed off of different web calls.
Could I suggest:
"Why buy the cheese when you get the milk for free?"
Look, if you build a product that relies on providing free information to the public, then you don't get to select a segment of that public and charge them for it. You can't hang a billboard on a highway, but then get upset when some people look at it the wrong way.
Now, if you want to have a walled garden and charge for entry to some and let others in free then that is fine.
> People want their data to be public and all of the benefits that comes with public data but then they want to chose who gets to see it - it's a complete and utter paradox.
That's a complete misconception. Of course you can find manufacture inconsistent ideologies if you combine ideas from different people, but I think you'd have a difficult time finding one person who believes what you just described.
What I want is, put simply, organizational transparency, personal privacy. I believe humans have a right to privacy, but I don't believe organizations have rights, period, and I believe radical transparency within an organization prevents organizations from trampling the rights of individuals.
Organizations in this case include corporations, governments, and nonprofits.
I'm interested in hearing your take on "organizational transparency". Like please push the concept / idea to its 'full' realization and tell me that picture, even if it implies a little bit of "sci-fi"¹.
Digging this because I think that domain / paradigm will see unparalleled evolution in the next few decades.
[1]: I mean, don't stop at current law / values / behaviors; like people from the 1940s wouldn't have dared speak about their idea of the 1970s because they'd think their belief "impossible". No flying cars though (Clarke-tech), because that's not a decision of the individual.
I don't think that looking too far ahead is useful: this is just a matter of pragmatics. Revolutionary change in a peaceful society happens via a long sequence of small, incremental changes, and that's a good thing, because you get to see how each of the changes plays out. I think the best sci-fi persuades you that it's looking at the distant future when in fact it's only using the future as a foil to provide deep insight into the present.
The short-term, the small, incremental changes I'd like to see are:
1. Reversal of the default privacy setting of government docs. Instead of documents being default-private and citizens having to make FOIA requests to make those documents public, documents should be default-public, and government workers should have to apply through and adversarial system (similar to courts) to classify documents, proving to a court why the document needs to be classified.
2. Classified documents should have a short (1 year max) timeframe after which they are declassified, or government workers should have to reapply to justify why the documents need to remain classified.
3. Political party documents should be public, without any provision for classifying them.
4. Tax-exempt organization documents should be public, without any provision for classifying them.
5. IPO'ed organization documents should be public, without any provision for classifying them.
6. Body cams on all police and military while on duty (when they are acting on behalf of an organization). 1 and 2 would apply to the footage from these cams as well.
7. Exceptions to 1-6 should be made for the personally-identifiable information of people who are not in the organization.
8. Organizations should be required to maintain a list of all the personally-identifiable information they have on a person (including employees), and provide that data to that person on demand by that person or their legal guardian, as well as a list of all people with whom that data has been shared, and be required to delete that information upon request by that person or their legal guardian.
9. Research which receives public funding should be forced to publish its results publicly.
10. All software which receives public funding should be forced to publish its source publicly.
11. Government documents should be published in open-source formats suitable for computer analysis (i.e. CSV, text, or some XML format--no PDFs).
1, 2, 6, 11 should be no-brainers if people were educated IMHO — but this is 1920 relative to electricity or cars; still a long way to go before the mainsteam masses get it (which very much includes political figures). I would think 2030-2040 for the emergence of ethical consensus and concern (the kind that pervades political parties and social classes).
That is assuming the needle doesn't move too much farther in the authoritarian direction until then (the 20-year trend is really not looking that way currently).
3, 4, 5, 9, 10 are/would be met by strong opposition from interest groups, I'm sure you see that too. Everything I know about 3 tells me it's never going to happen with current parties / politicians. It's at least 1 generation away and I'm not sure the concept itself isn't utopia. 9 and 10 as well, I think it largely depends on the cultural paradigm (and this world's in 2020 is really not aligned with that, nor does it trend or even look that way). 4, 5 likewise, complex topics, lots and lots and lots of gatekeepers and lobbyists.
My take on these is they're very costly in terms of political capital; and they are largely debatable (politically, legally, philosophically, etc., you'll find passionate captains on both sides); thus there are 'better' (more consensual, with direct net positive effect) lower hanging fruits imho.
7 and 8 are hard problems, notably because of scale and the need for automation — it's part of a much bigger domain, automation of compliance and building "trustable" systems etc.; the kind that bridge or plane engineers must build, and probably software engineers too, but you know we're far from that if you read this forum.
I'd say 1 2 6 11 and 7 8 on the way to scale/automation already paint a whole different regime and degree of maturity for a 21st century State. I'd like to think we're now ~1 generation away from enactment of such norms.
What if the organization is one person in an LLC? Do they get rights? If so then a big company can hire a bunch of little LLCs to act as rights-having proxies for any task that requires them.
I'm going to assume you're asking in good faith and try to address the confusion here.
The human does get rights, the organization doesn't.
In some cases, believing that humans have rights and believing that organizations have rights might lead one to the same action. In those cases, I'd take the action. I wouldn't want to violate a human's rights out of some vindictive dislike of organizations: that's not the point. The point is that I'd take that action because I believe in human rights, not because I believe that the organization has rights.
And with organizational transparency: the entire point of organizational transparency is to protect human rights. In cases where organizational transparency would trample human rights to privacy, I would go with human rights every time. Violating the privacy of humans to achieve organizational transparency would defeat the entire purpose.
Let's say that individual humans have the right to keep secrets. Let's also say that they have the right to keep secrets with their associates, and to tell them to who they please. Now, doesn't that make it legal for a group of people to keep secrets about you? What about selling them? I just don't see what doing away with the legal fiction of corporate personage would do about Facebook.
It may not be your intent, but you're using some very vague, inapplicable terminology to make some screwed up behavior sound normal.
If you can tell secrets to who you please and sell them on the internet, they aren't secrets. Somewhere in the middle of what you're saying, the secrets stopped being secrets, but you kept using the word as if it still applied.
Facebook isn't a group of associates trading anecdotes about their friends: the server guy has never met Mark Zuckerberg, and they are not "associates" in any meaningful way. They're not friends, or even really allies: Facebook certainly has shown inconsistent concern for the well-being of its workers. So let's also drop the "associates" terminology: these aren't "associates", they're employers and employees. Employees aren't acting as individual humans on their own behalf, they're acting on behalf of an organization.
Putting aside the rights conversation for a second, let me ask you a question: if you tell your friend a secret in confidence, and they turn around and sell it to anyone on the internet who will pay a low fee, that would be pretty screwed up, no? We don't even have to talk about rights here: this is just screwed up behavior, regardless of the rights conversation.
Are contracts allowed in your worldview? Contracts must be signed by individuals, but when they act as representatives of a company, they are legally binding for that company. If they only have individual rights, then all individuals who didn't physically sign a contract cannot be held to it.
At a small scale, it makes sense for people to appoint another person whom they trust to represent them in negotiation. But in a lot of cases, that's not how representatives are chosen. Particularly in corporations, the leadership of a corporation was not chosen by the employees to represent them, and in fact often doesn't even have the best interest of the employees in mind. A lot of the largest problems in our society arise from this fact.
Consider the case of a company that agrees to sell to a larger corporation, under the condition that they lay off half their workforce in advance of the sale. Surely we can agree that the laid-off workers were not fairly represented by the person signing the contract.
One could argue that the workers agreed to give up some of their rights in their employment contract, but I'd argue that they did so under duress: their option is sign the contract and work for the company, or starve and let their families starve. Sure, they can go work for another company, but other companies will require them to similarly sign away their rights.
This shouldn't be taken as a recommendation to blithely break contract law. Corporations don't have rights, but they do have power, and it would be unwise to behave as if they can't make your life miserable if you decide to cross them.
It's a bad deal for corporations, but I do not care. Lack of liability is the cause of a ton of problems in our society.
Just to pick two stories of corporate sociopathy: Probably the reason people at State Farm are unconcernd about forging signatures[1] is that they know that the worst case scenario is that State Farm loses some business and maybe gets a fine: they are unlikely to go to jail for forgery or to have fines exacted from their personal bank accounts. Similarly, when Practice Fusion literally killed people[2], their execs had little to fear: nobody went to jail, nobody was fined: shareholders who had no visibility into the decision paid the fines.
When banks tanked the economy with irresponsible lending most were bailed out and gave their workers bonuses, while the people who were unable to pay their mortgages were ignored.
A little more liability for destructive behavior would be great for most people.
Forgery is criminal regardless of private or official document is concerned. Even in a military setting, forgery of business-related documents are illegal.
> A little more liability for destructive behavior would be great for most people.
Why not full rights, full liability? Replace imprisonment and death by temporary and permanent suspension of company (including re-establishment of a sequel organisation out of a subset of stakeholders) respectively; and voila.
Making organization membership public would trample on personal privacy quite effectively in some respects, such as with disease support groups or PACs; medical privacy is taken seriously, but is there such a thing as political affiliation being private? Is it a violation of someone's privacy to reveal they give to the ACLU?
This is where pseudonymous identities make a lot of sense.
In a world where organizations are radically transparent and individuals have radical privacy, before you join the disease support group or donate to the ACLU, you already know the organization's records about that transaction will be public information.
You can restrict your activities to organizations that do not keep personally identifiable information. Or you could join the support group with a pseudonymous identity or donate cryptocurrency to the ACLU.
The entire point of organizational transparency is to prevent organizations from trampling the rights of individuals, so in cases where organizational transparency would trample the rights of individuals, the rights of individuals supersedes the need for organizational transparency.
I'm gonna need some context there. To begin with you can't just donate to support a bill. What you probably mean is that he donated to an organization or politician who supported Prop 8, so say that.
The important detail is whether the funds were his personal funds, or whether they belonged to his company: i.e. was he acting on his own behalf, or on behalf of an organization?
It's an inane question, though, because the inane point you're trying to make is that individual privacy might protect homophobes. That's true, but it would also protect gays and allies who donated to oppose proposition 8. The fact that privacy allows people to secretly donate to political campaigns is the point. Part of the reason gay marriage is legal today is that people donated to people like Harvey Milk, at a time when donating to the campaign of a gay governor was risking your job and social standing.
Human rights still apply to humans who do bad things. If you are willing to give up human rights to fight bad people trying to do bad things, then those rights won't be there to protect good people trying to do good things, either.
So a person spends money to support an odious cause, and that person gets money by being associated with an organization. You hate the odious cause, so you want to avoid giving your money to support people who will then give money to promoting that odious cause. How does that work? Do you no longer have the right to not indirectly support things in your ideal system?
Since you're just openly ignoring the post you're "responding to", I'll just copy-paste my response to what you have just said, with some minor changes:
> So a person spends money to support an odious cause, and that person gets money by being associated with an organization. You hate the odious cause, so you want to avoid giving your money to support people who will then give money to promoting that odious cause. How does that work? Do you no longer have the right to not indirectly support things in your ideal system?
Yes, but it would also protect people who donated to a virtuous cause. The fact that privacy allows people to secretly donate to political campaigns is the point. Part of the reason virtuous causes have had any success at all is that people donated to support them, at a time when donating to those virtuous causes was risking your job and social standing.
Human rights still apply to humans who support odious causes. If you are willing to give up human rights to fight bad people trying to support odious causes, then those rights won't be there to protect good people trying to support virtuous causes, either.
> Since you're just openly ignoring the post you're "responding to", I'll just copy-paste my response to what you have just said, with some minor changes:
I'm attempting to clarify my question by removing irrelevant details, which it looked like you got hung up on last time.
> Human rights still apply to humans who support odious causes. If you are willing to give up human rights to fight bad people trying to support odious causes, then those rights won't be there to protect good people trying to support virtuous causes, either.
So you are willing to make it impossible for people to boycott organizations as a means of social change, as long as those organizations had an arms-length relationship with anything "political".
> So you are willing to make it impossible for people to boycott organizations as a means of social change, as long as those organizations had an arms-length relationship with anything "political".
No. Please try to respond to what I actually say instead of making stuff up; this is a straw man argument.
There are plenty of other ways we could find out about organizations supporting odious causes and boycott those organizations, without violating the privacy of their members. In fact the the point of "organizational transparency" is to make it hard to hide when organizations do bad things.
In addition to accusing me of saying things I didn't say, you're ignoring what I actually did say. Are you willing to make it impossible for people to privately donate to virtuous causes as a means of social change, when donating to support those causes publicly so would be a risk to their careers and reputations? I'm not going to continue this conversation further if you won't respond to this point.
> I don't believe organizations have rights, period
So a group of people, joining together in a common cause, don’t have rights as members of that group?
You are contradicting yourself. Organizations are simply groups of people with a shared cause. To deny rights to the organization, you necessarily have to deny personal rights. Example: John and Sam form an organization for comic book collecting. They have a secret meeting to agree on a price to offer a potential seller of a rare comic book. Since John and Sam, Inc. “Have no rights” the seller is allowed to sit in the room during their meeting. However this violates the right of Sam to freely associate with John and his right to not have to associate with the interloper. It also violates the privacy of San and John as individuals since their private conversation — even as members of their two-man organization are now being shared with anyone who wants to sit in since, in this scenario, their organization doesn’t have rights.
> So a group of people, joining together in a common cause, don’t have rights as members of that group?
Emphasis added. No, they don't. They have rights as individual people.
> [...] this violates the right of Sam to freely associate with John and his right to not have to associate with the interloper. It also violates the privacy of San and John as individuals [...]
Yes, that. "Sam" and "John" would have cause for civil (and probably criminal) action against the interloper. "John and Sam, Inc" has no say in the matter.
It might be useful to grant John and Sam, Inc the privilege to own property, but even that isn't actually a right except insofar as Sam and John have a right not have the value of their assets (ie 50% ownership of John and Sam, Inc - functioning as a proxy for ownership of various comics) actively sabotaged/vandalized.
> Organizations are simply groups of people with a shared cause.
No, they're not. As soon as you have two people in an organization, you've got two different causes. The shared elements of those causes allow them to collaborate, but each of them has slightly different views of what they're working toward. And even when you have a very small, well-specified goal, every individual in the organization has different levels of investment, and other values and boundaries they're not willing to cross to achieve that goal. And the larger organizations get, the wider the variety of disparate goals that can occur within the organization, because individuals in the organization may not even interact directly with one another.
Example: John and Sam work for Facebook. John and Sam want to make money to feed themselves and their families, and don't want to have PTSD. But Mark Zuckerberg wants John and Sam to look at an endless stream of horrific PTSD-inducing images so that he can maintain the reputation of the Facebook platform and get incredibly rich. Where is the shared cause here, exactly?
> To deny rights to the organization, you necessarily have to deny personal rights.
Just because an organization doesn't have rights doesn't mean we have to go out of our way to take away their rights.
> Example: John and Sam form an organization for comic book collecting. They have a secret meeting to agree on a price to offer a potential seller of a rare comic book. Since John and Sam, Inc. “Have no rights” the seller is allowed to sit in the room during their meeting. However this violates the right of Sam to freely associate with John and his right to not have to associate with the interloper. It also violates the privacy of San and John as individuals since their private conversation — even as members of their two-man organization are now being shared with anyone who wants to sit in since, in this scenario, their organization doesn’t have rights.
It sounds like you've figured out that John and Sam, as individuals, each have a right to have a private conversation. John and Sam, Inc. doesn't have rights, but that doesn't suddenly remove John and Sam's individual rights.
The entire point of organizational transparency is to protect the rights of individuals, so obviously if organizational transparency would violate the rights of individuals, the individual rights to privacy supercede the need for organizational transparency.
It's telling that your example of an organization has two people in it. At a small scale, organizations tend to protect the rights of the individuals in the organization fairly well. It's at larger scales that the non-rights of an organization come in conflict more often with the rights of individuals.
You see this inconsistency in individual people all the time. Like how almost everyone wants to be found easily on LinkedIn, Twitter, etc, so they add a public profile photo. Then they freak out because someone starts scraping these photos to build a facial recognition model.
One of my clients is involved in property tax collection and reporting. Property Tax records are public info, and their website allows looking up the records for any property without a login. However, the data behind this website it the _source_ of the public records, and not the public records themselves (which would be local government databases).
For years now we've been in an arms race with someone using a botnet to scrape all of the account information for a particular county. My client doesn't care so much about the data; it's the server load that's a problem. Normal activity for this site is a few dozen account searches per minute, but when the botnet gets through our blockade it sends hundreds of search requests per second, overwhelmimg the site. The operator of the botnet has NEVER tried to contact my client to ask for an efficient api to access the data, which they'd probably provide for a minimal fee.
I work on a fully open data repository. The website has the API linked in 3 places, so when I find inappropriate scraping I block it with "HTTP 420 ... see <API link> or contact <email>".
Some people probably switch to using the API, but no-one has ever contacted us. They either give up, or run their scraper on a different computer -- I've seen the same scraper move between university computers, departments, then (in the evening) to a consumer broadband IP.
I really don't understand why anyone would bother writing and using a web scraper when an API exists. Does the API not provide all the same data/functions as the website? Scrapers are a big PITA compared to just using an API: they're much harder to write to be reliable, and they can break at any time, whenever the site makes even the smallest change. APIs avoid all that mess, and make performance far better too (on both sides), since you're only downloading the data you want, not a ton of Javascript and HTML that you don't.
APIs are often not as complete as the web interface, since the customer sees the web interface and normally the customer is what drives the revenue model of the company.
If pages are driven via an API, then the API is preferable, but publicly facing websites are often a mix of server-side HTML generation and API enrichment, for caching if nothing else.
In that case it seems that the webmasters complaining about scraping need to make sure their APIs actually provide access to all the same data, if they want people to use the APIs instead of scraping.
If the scraper contacted the client, said what they need the data for, and (probably) paid for api access, then my client would probably go for it.
My client is under no obligation to make access to this data easier. It's not really their data either; the information is property addesses, owner names and addresses, and tax assessments and payments. My client wouldn't want to make it easier for scammers to get that data. So they're not going to do anything unless they know the scraper is legit. If that's the case, the api would require authentication, and any fees would be for the server load, not the data.
For what purpose? That’s like suggesting that if people keep jumping your fence and trampling your roses because it’s a shortcut to a public park (in this case, the county records office) that already has public access roads, that you should be obliged to build a sidewalk through your garden, at your own expense, when the real answer should be that the public road should be improved.
If your goal is to get your roses to stop being trampled, it's probably easier to install a few pavers than to spend years petitioning to get a road built.
The ideal answer and the efficient answer are not usually the same.
Yeah, the real problem with scraping is that it's often done very haphazardly and bluntly. Sometimes it's very difficult to tell the difference between a scraper and someone trying to DOS your site.
I'm in a similar job. We block people from scraping if they break a threshold, but we also refer them to the reporting system, which can get all of the information that they are collecting in a variety of formats.
I wonder if something like this would be allowed: if all the public information was available in a well-collated format, then can scrapers be blocked? I imagine that will eventually be fought in court as well.
We did figure out who the scraper probably is, but only after several years. For a long time they used an untracable botnet, but after blocking that they eventually switched to a corporate network we traced to a data aggregation company. But we don't know for sure who's doing the scraping; it could be the company, a rogue employee, or a botnet that got loose on their network.
At work we have all of our data available publicly as easy to parse XML files, but no matter what we do the bot owner's refuse to use it. They'd rather hammer our search engine with sequential searches instead.
We've done that, but it's tough to rate-limit a botnet because of the ip address spread. Also, their crappy scraper software doesn't even bother to check if requests are successful; it spews them just as fast no matter how our site responds.
No. They botnets works through multiple regions on multiple cloud providers - that's how they achieve such high throughput. For any single IP address, the load is reasonable, but for the whole botnet it's absurd.
Currently bot traffic accounts for 2/3 of my load, meaning that the cost of providing my service is 3x what it would be without these persistent bots.
just put an option to download the raw csvs buried somewhere there. someone who is putting in the effort to bot scrapers will find that link, and save your server the load.
People don't want their data to be public. People want other people's data to be public. One's own data everyone thinks should be private and tightly controlled. This applies to people and businesses equally.
In this case, LinkedIn users kind of do want their “public profiles” to be public. They’re online CVs; by definition, if you make one, your goal is to get it into the hands of anyone who asks for it!
LinkedIn, likewise, has built its business model on an implicit contract with its users that it’s going to show their CV to anyone who asks for it.
I think LinkedIn users would be surprised that LinkedIn doesn’t let bots read (scrape) their public CV. A CV (individually, rather than in aggregate) is ultimately useful for only one thing: marketing the CV’s author’s skills. Why wouldn’t I want my marketable skills scraped into some private “talent matchmaking” agency’s databases, such that someone could find me—and hire me—when I show up as a result of some fancy OLAP query they paid that agency to run on their scraped data? It’s more roundabout than them just finding my CV on LinkedIn, but I’m still glad they found it!
>I think LinkedIn users would be surprised that LinkedIn doesn’t let bots read (scrape) their public CV.
LinkedIn is really really clear that:
- They won't share you information with 3rd parties
- You're not allowed to use information on LinkedIn for commercial purposes without their permission
- Other users can view your personal data
So, why would I expect random third party companies to be able to scrape and sell my personal information?
My personal information is there for the individual use of others, and for authorised use by recruiters (who are vetted/managed by LinkedIn).
I've chased down the convention spam mail I get using my GDPR rights, and surprise surprise, they got my details by scraping LinkedIn. That is absolutely not expected nor acceptable use of my data...
There’s a difference between “my information” and “the public webpage that I went through a publishing workflow to create from a curated selection of my information.”
Let me put it this way: if I have a Wordpress blog, I’d certainly be miffed if Wordpress let bots see my drafts... but I’d also be miffed if Wordpress didn’t let bots (Google, for one!) see the published blog itself. It’s a blog; a public website! Anyone or anything with the URL is supposed to be able to retrieve the page! It’s not “my information” any more†; it’s been broadcast!
† You might want to mentally analogize to copyright, but I don’t think it’s the right model for the intuition people have here. Instead, try mentally analogizing to confidentiality. When a classified document is published in the public sphere (e.g. as evidence in a trial, as testimony before congress, etc.), this forcibly declassifies it. No matter how much the originator of the document might want to still keep it a secret, the legal protections of confidentiality don’t apply to it any more: it’s out there now. Anyone who reads it could plausibly have just read the public-sphere copy, so there’s no longer any way to charge people who have knowledge of the previously-classified information with any crime.
> Would you be OK with a company scraping your blog and selling it?
Selling it how? If they put my blog posts in a book and try to sell that book, that’s copyright infringement. If they put my blog posts in an ML model corpus to train a translation service, and they then charge pay-per-use access to the resulting service... I don’t think I’d care, nor do I think there’s anything morally or legally wrong with that. If they scrape my name and phone number and generate a Yellow-Pages-like index from them? That’s explicitly allowed by law; and heck, that’s why I embedded the information onto my site in vCard microformat in the first place!
To put my philosophy succinctly: if web.archive.org can scrape your data without you having an explicit relationship with them granting them that right, then bad.evil.com can too. You can allow both (= publicizing your information), or neither (= protecting your information), but you can’t allow one but not the other. “Third parties you don’t have a relationship with, who access your data through the public sphere without entering into a specific licensing arrangement with you” are legally one big amorphous blob. You can’t make a law that splits that blob up, because it’s an opaque blob; in the ACL system that is contract law, all entities you don’t have contracts with are just one entity—“the public.” If you want some specific entities to have access to your information, that’s what protecting your data (= setting an ACL “the public = disallow”) and then explicitly licensing it out by entering into contracts (= setting an ACL “entity X = allow”) is for.
Why can't I have terms on my website that say how you can use my information?
Examples where this is allowed:
- Images/media (Creative commons)
- Code (Open source licenses)
You say it isn't allowed for:
- Personal data
Unless I'm misunderstanding your philosophy (which seems to say copyright is OK, but public information must be public to all): You believe that it's morally OK for me to prevent a company selling my book, but not morally OK for me to prevent a company selling my name, job title and employer as a marketing bundle?
Edit: An aside, it's really confusing that you seem to be editing your previous replies minutes after I responded. I thought HN only let users edit during the "no replies" period?
>Why can't I have terms on my website that say how you can use my information?
>- Personal data
So there's a couple of things in play here. You can't (generally) copyright facts - "Cthalupa is a Rocket Surgeon for the Space Force since 2001", if true, would not be something that I could get a copyright on.
The second thing is that terms have to be agreed upon by both parties. If you give me information without us coming to an agreement on terms, I can't be bound by them. If you just put a link to a TOS on your website and don't require people agree to it before giving them access to data on your website, we did not enter into a contractual agreement.
> Why can't I have terms on my website that say how you can use my information?
Neither Creative Commons nor copyleft (nor copyright in general!) can assert anything about private use. IP rights are commercial rights; they affect sellers of your IP. They don’t affect end-consumers of your IP.
Note that even the GPL can’t force someone to publish the source of their GPLed-library-containing program, if they never publish the program itself, but only build it for their own private use.
Why? Because, by broadcasting the code of your GPLed library, you granted people an implicit use-right to it! Not a redistribution right; not a derivative-works right; but a use right. (If this wasn’t true, then people would be breaking the law by reading “common” newspapers in a cafe, or by listening to the radio, since they never entered into any explicit contract with the distributor/broadcaster.)
How does software licensing work, then? Mostly by 1. companies installing software on computers for their employees to use being considered IP redistributors; and 2. attachment of copyright through sampling when asset samples [e.g. brushes/textures in Photoshop] are distributed through the program. Other than that, there’s really no law forcing end-users to pay for software licenses. This is why e.g. WinRAR would never have been able to sue anybody. They published their shareware binary (without gating it behind a contractual relationship, like Adobe’s Creative Cloud installer); so now you have a use-right to it!
> You believe that it's morally OK for me to prevent a company selling my book, but not morally OK for me to prevent a company selling my phone number?
Copyright exists because your ability to make money from your own creative works hinges on your ability to exclusively license those works. If a publisher can get a redistribution license to your manuscript for free from a third party, why would they buy it from you?
You having exclusive access to your phone number does not make you money; others having access to your phone number does not deprive you of money you could have made by keeping that information private. Thus, there’s no advantage to introducing IP law into this domain (the domain of facts.)
There was a recent court case about someone creating a subway map by copying the raw data from existing subway maps, where the comments went deeper into this.
Right, sure, I hadn't really thought about how copyright isn't really enforceable against individuals. That's very interesting.
However, I'm really not sure how this is relevant to your moral stance on commercial use of "public" personal information.
Why do you believe it's reasonable to prevent unauthorized commercial exploitation of creative works, but not to prevent unauthorized commercial exploitation of personal information.
The former simply affects the small percentage of people who sell their works.
The latter affects the vast majority of the population who receive targeted spam, have their information collated and sold for profiling, are victims of identity fraud when those databases are inevitably leaked, etc.
For what it's worth, as I mentioned in my first comment, the GDPR absolutely gives me rights to control how my personal information is used. And the GDPR has a near total exemption for individual use.
What benefits do you see of commercial use against the wishes of the person that published it that outweigh the risks? (making money isn't a benefit)
I think you got the wrong idea if you were thinking I was saying copyright isn’t enforceable “against individuals.”
My example of copyleft was specifically about the thing the Affero GPL tries to avoid (to unknown success): the possibility of someone using GPLed libraries to set up a commercial web service. Because they never release the binary, but only have people interact with it over the Internet, there’s no derivative work being made available in the commercial domain. So copyright doesn’t apply. Even though you’re a company making money off GPLed libraries!
I have a linkedin so that I can point people at it. I also want human recruiters who have actually read the thing to send me relevant jobs. If my profile ended up affecting my credit report, I'd be pissed. I expect you would be too.
People put data places for specific purposes (to show recruiters) and want the ability to limit use to that purpose. How that's accomplished is just a technicality most people don't care about.
Not so sure about that. Messages on LinkedIn are mediated in a single place, and you have a measure of control over how your profile shows up in searches. If your CV is scraped, you could end up anywhere and now you're getting recruiter spam from all over when you're not interested.
> If your CV is scraped, you could end up anywhere and now you're getting recruiter spam from all over when you're not interested.
I would point out that this is still possible (even probable!) without any bots being involved at all. Back before CVs were online, humans working for recruitment agencies would “scrape” information from local, physical job boards by hand into their company’s databases (where “database” here could just mean a filing cabinet.)
IMHO, the real solution to that is a spam filter (or an “agent”, in the old world.) Just because a lot of people want to talk to you, and most of them aren’t very interesting, doesn’t mean they need to be prevented from accessing you—they just need to be prioritized by interesting-ness, which is something you can do yourself, or hire a service to do for you.
I think the GP in this context means, e.g. LinkedIn wants their information to be public in the cases where it benefits them as a business. But then they want it to not be public when it doesn't benefit them. There is no such thing as "public information, except ..." - information is public, or it's not. If none of LinkedIn's data was public, they would have a much harder time getting people to sign up, and having as many users signed up as possible is part of their business model.
From a copyright perspective (since that's what LinkedIn's lawyers claimed): imagine if a newspaper sued another newspaper, saying that - not just the content of its paper - the information in the newspaper was copyrighted and could not be accessed by "unauthorized" third party companies. Either you print it, or you don't!
I want to be able to use LinkedIn to network with colleagues and people in my industry. If someone wants to scrape my profile to make a report on industry trends, I’m fine with it. What I don’t want is hiQ vacuuming up my data so they can snitch to my employer if they think I’m job hunting.
How is this a paradox? Tech — the web in particular — is supposed to be an equalizing force, but HiQ is clearly trying to give my employer more power over me. We are an industry that prides itself on solving difficult problems — how is our response here to just throw up our hands and say “it’s all or nothing”?
Wow, I just looked up what hiQ does and have to say it's pretty scummy in my opinion. Why do people create stuff like this? Don't they know it will likely come back to bite them one day?
For reference:
"There is more information about your employees outside the walls of your organization than inside it. hiQ curates and leverages this public data to drive employee-positive actions.
Our machine learning-based SaaS platform provides flight risks and skill footprints of enterprise organizations, allowing HR teams to make better, more reliable people decisions."
The thing is, GDPR has theoretically solved this in the EU. The UK's ICO is about to publish guidance prohibiting scraping public user information for marketing (where the user would not expect it to be used for that).
It's a really easy solution, because companies need to prove how they got your data when asked.
When you track the source of the mailing list you're getting spam from and they say "We scraped it from LinkedIn", they get fined.
We haven't solved that in the same way we haven't "solved" encryption not having a magical good people only door despite spook tantrums. There fundamentally isn't a possible mechanism and really wanting it doesn't change that.
It is a result of equality - not of outcome but rules. Open for everyone but those whose applications you don't like isn't open. On a technical level trying to prevent it is like the "evil bit" as a solution to malware.
Of course there are possible mechanisms. There are heuristics to detect bots. The whole reason for this lawsuit is that LinkedIn blocked hiQ from scraping their website.
I'm also not necessarily talking about a technical defense against unwanted scraping. Write a law makes it illegal to do something like "scraping personally identifiable information and storing or presenting it non–anonymized", and prosecute companies who break it. I'm sure there are loopholes in that particular example, but the point is we can absolutely add shades of gray here.
> Open for everyone but those whose applications you don't like isn't open.
Openness should be a means, not an end. If we make something "not open" but it prevents 95% of undesirable uses and only 5% of desirable ones, is that not a tradeoff worth discussing?
"public" is not the right concept here I think. E.g. imagine a composer conducting a public airing of some work of music (e.g. on some festival). That you were able to hear the music in public doesn't mean the {composer,artists,...} give up their copyrights.
I think copyright as you mention here is the right concept, or at least a lot closer. In particular, the limits on copyright. If someone is reciting a list of facts in public, they can’t expect people not to record those facts, because copyright doesn’t apply to that. Reciting the list in public using computers shouldn’t change that.
Agree with this, and I would also add this
- I want to be in control of my data and change the setting.
- I want to be able to delete my data.
If scraping on linked-in is banned (and linked-in is enforcing it), then I do have have control of my data, since I can change the setting, and it will no longer be public (It's not perfect, since some might already scraped it, but the extend would be much smaller). Also, if I decide to delete my data, linked can do that for data it control, but not for scraped data.
Scraping information is not the same as posting. There are a number of bots that scrape Hacker News and people here generally consider them pretty cool.
Isn't that a different paradigm, though? A posting bot set loose on a forum/platform will (normally) visibly degrade service in a much more visible and impactful way than a scraping bot. And in either case, writing (and running!) a bot that posts on HN is not illegal behaviour in itself.
I'm on Hackernews to see interesting articles and read interesting conversation. If a bot can post interesting articles and make interesting conversation, I'm not sure I care that it's a bot. And if a human can't do those things, I'm not sure I care whether or not they're 'real'.
That focus on "we don't care if you self-promote, we don't care why you're here, we just want you to be a good citizen" is part of why I like HN.
It's not completely black and white, but in general I believe that users online have the Right to Delegate[0]. That right should only be legally taken away if there's a really, unbelievably compelling social justification for doing so. I am pretty skeptical that banning web scraping has that kind of justification.
I don't really care if the comments are from bots, per se. I care if they are quality comments or not. Whether or not the comments are from bots is just a proxy for whether or not they are actually good.
Humans cannot directly access websites - a machine is always involved. I know perfectly well what you mean but the distinction on a deeper level is fundamentally imaginary.
The best is some sort of heuristic like captchas and even they can be outsourced so that the human doing them isn't actually viewing the content.
The thing about a bot which bothers people is the behavior anyway. A bottish acting human would get people just as upset.
Yeah, I have nothing at all against scraping per se, it's more about the huge bot traffic the commercial scrapers generate, which would ALSO be fine, but 1. it can be hard to tell scrapers from malicious DDOS bots sometimes and 2.) the person being scraped literally pays for that scraping traffic.
Yeah, but where do you draw the line between 'oh sorry we dropped your request because of rate limiting' (or whatever mitigation strategy) and 'oh, we dropped your request because u scraping us bro' legally? IANAL, but this lawsuit seems to indicated that putting barriers in front of scraping attempts is a no-no.
There's plenty of grey here. For example, scrapers that try to check people in for flights to get better seats. Some that tried to charge for that. That creates problems, where some customers benefit at the expense of others, high load on a "locking type" piece of code, etc. Similar for ticket sales for concerts, and probably other spaces.
There are also companies that provide added value by compiling and correlating "public info" in a useful way that creates value. If Google let me scrape their search and remove ads, it would be popular, but is it "legal"? Or maybe Google Maps?
I would think, and of course could be wrong, it would be as legal as Google scraping all of the web sites that they do in order to create their search engine in the first place. In particular, Google provides cached versions of web pages. That's pretty hardcore scraping.
Another problem with web scraping is a B2B website offering services for B2C companies to better reach consumers. It can be a tricky thing to do without basically giving your clients list to all your competitors.
Sure. My data is still my data, and if I publish it on my platform for free, that still shouldn't automatically give you the right to copy the data and provide on your platform.
It's basically the same as a TV broadcasting a film for free, and then going after you legally if you recorded that film and uploaded it to your website.
This does not legalize theft, it says sites cannot respond to suspected scrapers differently than they respond to non-suspected scrapers. You can still rate limit, as long as you do it universally to all site users
Copyright law is unchanged. If someone scrapes your blog and then re-uses your posts on their own blog, you still have possible copyright infringement claim
This is a bad analogy as scraped but copyrighted works are still protected by copyright. Whats in question is whether you should have exclusive rights to information you have shared but did not copyright. Seems people also think this noncopyrighted data should also be protected even if its munged and added to an original work by the scraper.
Billions are being made collecting data and content for free and publishing it along with targeted advertisements. Web-scrapers can collect that aggregated data and redistribute it or create competing services. This is terrifying to certain tech giants and threatens their moat's and lock-in/network advantage. They will lobby hard against it with PR campaigns citing everything from security, privacy, copyrights and all sorts of other exaggerated bogeymen. At the end of the day its mostly about preserving their monopolies though.
I can understand why some do not want scrapers - increased traffic (with practically zero benefits to the owners) is one obvious reason.
(Some people will then say "But why not just offer APIs", but that's a lot of extra work and maintenance).
It's like with instagram and other social media platforms. The content creators put in the hard work, while the leeches are stealing content for their own benefit, giving zero credits to the original content creators.
Let's not conflate a) a person's personal data, and b) a business's dataset. The GP and the article are clearly referring to the latter. Preventing web scraping won't protect users from businesses collecting their data.
What I'm trying to say is: For every popular content creator on IG, there are tens and hundreds of (more or less automatic) content curators that do nothing more than scrape content with lots of likes, and re-post on their own channels. Then when they get sufficient followers, they make money through paid product placements, account flipping, pay-to-play sharing, and what not. More often than not, there's no linking to the original pages / creators.
People want their data to be public and all of the benefits that comes with public data but then they want to chose who gets to see it.
By People you mean business or people people? Because I don't think people want everything to be public, many in fact use various networks to avoid oversharing and even then many people don't want their old bosses or exes looking at their profiles, there just don't exist tools to limit access that granularly.
I don't think the direction would ever be clear, even if the legality were clearly established. The arms race would intensify, and detecting/blocking/deceiving scrapers might become a lucrative field.
Companies want to provide some information to some people; but providing all information to all people is analogous to allowing customers to make a meal of free food samples, on a recurring basis.
The arms race would intensify, and detecting/blocking/deceiving scrapers might become a lucrative field.
It already is. There are entire companies, like Distil Networks, who exist solely to protect companies from bots/scrapers/etc. Actually, looks like Distil got acquired and are now part of Imperva, but anyway, the idea is the same. This is definitely an existing field.
Disclosure: former Distil employee, but I have no financial stake in this discussion, and have mixed feelings about scraping. Clearly it can be beneficial in some situations, but when I think about having to pay exorbitant prices to scalpers for tickets to an event, because they used a bot to buy up all the tickets, that is less appealing.
> People want their data to be public and all of the benefits that comes with public data but then they want to chose who gets to see it
Can you really not imagine a world where a person accidentally or in poor judgment uploads something private to their own site (their real name, home address, credit card#, or any piece of highly damaging information that could cost them their careers) and wishes for it to be removed? (but can't because many of these scraping sites never respond to takedown requests)
People make mistakes and post things they shouldn't. A mistake from someone many years in the past that they've made amends for shouldn't haunt someone for the rest of their lives.
But it does when we decide that every single line of text ever uttered online must be preserved and easily accessible by anyone for all eternity.
Blocking scrapers is an arms-race escalation because these sites refuse to remove content, and it's used as a tool for character assassination by bad actors. It's a proactive defense.
It sounds like the same old issue - lack of understanding of the fundamentals underlying technologies.
Otherwise they would realize what they demand is contradictory and incoherent like demanding to be both viewable by all and not viewable. DRM is one fundamental example of it.
they want there data to be public for a specific use, I feel like that's pretty easy to understand.
LinkedIn: they want their info to be public -> to get jobs
The title of this post is misleading. The Court's decision related to HiQ's attempt to obtain a preliminary injunction. It's clearly an initial victory for hiQ in that the Court affirmed granting an injunction based on a significant likelihood that hiQ would ultimately prevail and that HiQ would suffer irreparable damage if the injunction was not granted. However, the Court never actually dealt with the merits of the case and, accordingly, stating that the case has precedential value is misleading. The Court itself noted:
"At this preliminary injunction stage, we do not resolve the
companies’ legal dispute definitively, nor do we address all
the claims and defenses they have pleaded in the district
court. Instead, we focus on whether hiQ has raised serious
questions on the merits of the factual and legal issues
presented to us, as well as on the other requisites for
preliminary relief."
German copyright has the concept of a "Datenbankwerk" (since the 90s).
E.g. the telephone book contains lots of boring facts that are each in themselves not copyrightable. However the collection in itself is copyrightable, as it required substantial effort to create.
It seems odd that US copyright law wouldn't have a similar provision, or that it doesn't apply here?
Also known as "sweat of the brow." The US Supreme Court rejected such protections in the 90s. I'm sure it's complicated, but copyrights are not presumed.
It does, and the result is that phone books and maps get fictional entries inserted in order to prove copying -- because it is perfectly legal to do your own work to amass the same data set.
Like a Paper town or that time that Genius caught Google stealing their lyrics by playing with straight and curly quotes and writing a tool that watermarked songs by interchanging those quotes on the site. It was brilliant.
If you scroll down to the "Legal Actions" section on Wikipedia, you'll find that these copyright traps have generally failed to serve their purpose when attempted in courts.
There is a concept in US copyright known as "thin" copyright. Collections of uncopyrightable information can in fact be copyrighted, but the arrangement of the information must have some spark or minimal creative energy. Mere "sweat of the brow" is insufficient to confer copyrightability on the work.
Such "thin" copyright tends to mean that there is a strong presumption against infringement. You generally need to demonstrate that the work has been copied virtually in its entirety to find infringement; partial borrowing is insufficient.
IANAL, but I think the US does have something similar. In the US, you can take works from the public domain, and then perform some sort of work, ex: restoration of a film, and then copyright that work. The difference being that someone else can use/release their own version of the same content, but they cannot use the work you did just because the content comes from the public domain.
But I don't think that's relevant here, as a) this isn't a copyright case and b) HiQ are not attempting to recreate the entire "compilation" of LinkedIn
I'm pretty sure the US has the same thing, basically. You can't copyright facts, but you can copyright presentation. So someone can't copy your map directly and sell it, because there's an artistic component to it. But they can make their own map with the same data, with a different style to it, and sell that.
> Now many site owners are trying to put technical obstacles to competitors who completely copy their information that is not protected by copyright. For example, ticket prices, product lots, open user profiles, and so on. Some sites consider this information “their own”, and consider web scraping as “theft”. Legally, this is not the case, which is now officially enshrined in the US.
Does this mean we can now scrape e.g. YouTube videos, Amazon reviews, IMDB reviews, Facebook events ... ?
Yes you can scrape them, no you cannot repubilsh them. Everything you listed is protected by copyright. You cannot infringe on copyrights because of this ruling.
>hiQ argued that LinkedIn’s technical measures to block web scraping interfere with hiQ’s contracts with its own customers who rely on this data. In legal jargon, this is called” malicious interference with a contract”, which is prohibited by American law
Does this mean that Google's random recaptcha check is interference?
I think any ruling that says LinkedIn can't put in protectionary measures against automated requests is doomed to be overturned, as long as they're not doing it discriminately. Captcha, rate limiting, user agent testing, etc are all common tools to protect against malicious/unintentional denials of service. The question is what was LinkedIn doing, and did it specifically target hiQ while permitting others of the same class of traffic.
Why would it be an issue if it is discriminatory? Linkedin can use its servers any way they like, unless they ve promised their users that their data can be scraped indiscriminately
See that's where I have problem with this. Isn't data just _data_?
Lets draw some pararells to real life. If I go to public space like town square - can't I take pictures, notes and records then go home and draw my analytics from it? What if I read something in a book I bought, can't I quote it?
Same thing should be with web resources even if they are creative - as long as I don't publish them I should be able to scrape whatever public resources I want and use them in my analytics, machine learning or whatever.
This is why I strongly prefer the Dutch term 'auteursrecht' (author's rights) as opposed to copyright. Copyright has this annoying incorrect connotation that it has anything to do with copying when it's really publishing that it should be limiting.
Downloading publicly available data should (by definition of public) not be a violation of someone's rights. However it's easy to see why it wouldn't be desirable for someone to republish creative works as their own, so it's reasonable to give the author control over how their work should be published.
And in the case of price data or similar you would be hard pressed to deem anyone the 'author' of it, hence it would be weird to enforce the author's rights.
>Copyright has this annoying incorrect connotation that it has anything to do with copying when it's really publishing that it should be limiting. //
Copyright does make _copying_ tortuous. Broad personal use exceptions in USA, for example, make this appear not to be true, but it is the act of copying - even without publication - that is protected in general.
Ripping a CD in UK, for example is copyright infringement without a general personal use exception (there are exceptions, under Fair Dealing, but whatever you're doing almost certainly doesn't fall into them).
See eg UK CDPA1988, Chapter II, section 16(1)(a); or USC17, Chapter 1, 106(1).
You are discussing the fair use provisions of copyright law.
Not a lawyer, but:
You can do all of that, but:
You cannot scan the book you bought, and put it on your website for sale or even free - unless it's copyright is up or you are given permission by the copyright holder.
You can not take a picture of someones painting in high detail, then sell prints of it - unless it's copyright is up or you are given permission by the copyright holder.
Your examples are really wanting greater freedom to copy rather than about the distinction between data and creative work. Copyright is supposed to encourage people to make creative work, not encourage people to record existing facts. I think this distinction is important because creative work isn't actually necessary to anyone else - they could create their own different one if they wanted. But data might only have one correct value and if that was locked away by copyright, it would limit other people's ability to do things that can't be done with some different data.
Think of Law around data as using dependent types. The legal protections depend on the type of the data, and the type depends on the content (among other things). You have to determine the type BEFORE you can tell what the law says about it, since the law only cares about the type. You could probably encode the law nicely with something like Idris, but any "code as law" type governance system without dependent types won't be able to express existing law.
No. At the risk of just repeating the comment you didn't understand, creative works are not "just data" - they are copyrightable works that the owner has control over who can use them, not just for profit, but for any reason with few exceptions.
You don't just get to drop someone else's work product into your algorithm without their permission.
I don't think using data as input to an algorithm necessarily breaks copyright law.
I can read a book to post my impression on it somewhere right? I can read it and say "it was beautiful" on twitter.
I can then automate my "taste meter" through machine learning, it reads a given book character by character, and spits out what I'd think of it if I actually read it. Then posts it on twitter, says "it was beautiful".
You can't take something copyrighted by someone else and re-distribute it without their permission. However, I suspect you can capture it freely if you don't re-distribute it.
Neat straw man but you're actually proving my point. There are scenarios under which they can't do that (fair use) but there are also many scenarios where they would be entirely within their right to do so.
Probably not since each star review is a separate "work" by a separate author. Mechanically combining multiple non-copyrightable things into one doesn't make it copyrightable. If Amazon arranged their users' star reviews into an infographic that would be copyrightable.
I think perhaps the layout, cover letter, and maybe any flourishing notes are copyrightable, but the actual details of work experience and education are not.
There's some huge datasets of Amazon reviews available. Stanford has a big scrape out there, plus there's one from Amazon themselves in the AWS datasets.
I believe South Park used a videoclip from youtube, and Youtube’s ContentID system removed the video South Park had used, because Youtube considered it a violation of South Park’s copyright.
Just because YouTube gets it wrong doesn't mean it's just theory. YouTube is not the only site that has automated content scanning for copyright violations. Getty and other photo sites have gotten this wrong in the same way by sending C&D letters for violations to the actual copyright holders.
I'm glad scraping isn't criminal. Applying the CFAA here is ridiculous. But saying LinkedIn can't put technical measures in place to prevent scraping seems like a huge stretch to me. Why should they have to pay server costs for persistent scraping, particularly from a company that is actively trying to harm them?
So many ideas start to come to mind if scraping is legal.
Can we start to scrape Google Search in order to bootstrap building an alternative to Google Search? Search is a really hard problem (that somebody should tackle), but if we can leverage what Google has already scraped from the web and associated with popular search terms, we can use that to help train and validate our search model.
Can we scrape Reddit, Twitter, or Facebook in order to stand up a competing service that strips out all the ads? It's hard to bootstrap a social media website, but if you can import all the content from the existing giants, your site is no longer a wasteland.
Can we finally scrape and get rid of IMDB? I'd love to put all of their content on a wiki and be done with it.
Seems like a hard problem to legally solve. I can see so many valid use cases for bots to scrape pages. But in all of your examples, I'm inclined to say that it shouldn't be allowed.
Maybe it falls into a "fair use" situation? Obviously copying an entire website would not be considered fair use, but something like scraping a bunch of public profiles on Steam to get aggregate data on what games are played the most seems totally valid.
Hopefully it doesn't end up with everything gated behind a sign-in and a TOS.
> Can we scrape Reddit, Twitter, or Facebook in order to stand up a competing service that strips out all the ads?
Even if web scraping was definitively legal (this preliminary injunction doesn't mean that), that doesn't mean you can bypass the content creator's copyright. Non-copyrightable functional data is one thing, but copying all of Reddit, for example, would include copying https://www.reddit.com/r/WritingPrompts/ and that would definitely be violating the rights of the authors.
> Just because you can scrape the content legally does not mean you can also republish it on your own website.
Except IMDB copied all of its data by scraping publicly available data posted to Usenet back in the day. And they still rely on volunteer contributions. [1]
>>Most importantly, the appeals court also upheld a lower court ruling that prohibits LinkedIn from interfering with hiQ’s web scraping of its site. This fundamentally changes the balance of power in dealing with such cases in the future.
This, I don't agree with. I agree that it's not fraud to send a bot to scrape public info but the site should every right to block a bot or person.
This doesn't sit well with me either. However, LinkedIn are trying to redefine a fundamental principle of the web, i.e. easy access to publicly available information, and they're doing it simply to protect their commercial interests.
It would be terrible to see the web compromised by a spat between two (in my opinion) scummy companies, and I think we could do with some hard push-back against attacks on the web generally.
One question to those who dislike web scraping as they deem it infringes copyright laws.
Given that Google scrapes LinkedIn public profiles and add data from it to it’d index and when presenting search results, is it then not discrimination that Microsoft tries to block HiQ and not google?
1. Split public setting into 2 settings: public for everyone including scraping, and second choice that make it public for Linkedin users (aka banning scraping [1])
2. Everyone on Linkedin who have current setting public would go though dialog to choose which one, with default setting to ban scraping (which is hard to criticize, since that option is protecting privacy more).
3. If you allow scraping, the form should warn you about consequences, and make you to acknowledge them (which is again a good practice, if you want to expose your information to EVERYONE on the internet).
4. Now you have users consent to protect their data from scraping.
Sounds like dark pattern? Kind of yes, but for good thing and it's hard to argue against it, since it increases privacy.
[1] Idea here is that I can allow anyone on Linkedin to see my profile, whether it is potential employer, or someone who met me at the conference (i.e. to improve my Linkedin experience). But I do not want that data to be harvested, by third party for any reason (specially, because they would then use it to to send me spam / advertising / prefill my profile on different website / ...).
Legal: It will be hard to argue in court that you have right to create fake accounts. That is definitely against TOS, and at that point Linkedin can sue the company instead, so I believe it would scraper would not be successful at court.
Technical: This is actually what you want, as limiting scraping per user is easier than per IP, since IP can be shared by users, so your rate limits have to be higher for IP address than for individual user. Additionally, creating fake account is more work which will in the end making scraping more expensive.
This is phenomenal timing for Clearview AI who in the last week was exposed by the NYTimes for working with law enforcement to identity suspects via their database of web scrapped images of individuals.
So, if company sells a product based on scraping google search results, google trying to block scraping would constitute "malicious interference with a contract"?
I ran a company based on web scraping for a couple years and I heard never ending comments about how what we were doing was illegal. Thank god that conversation is over.
Just wow. Don't create throwaways to comment, especially ridiculous comments that are objectively false.
Not every website is selling something, and there is no good faith way in the world to think of saving the result of an HTTP call from a service designed to serve up HTTP calls "abusing."
Let’s not pretend this is a pure win. There are good uses of web scraping, like Archive.org trying to preserve the web. But what HiQ is doing is looking at public LinkedIn profiles and then snitching to employers if they think an employee is searching for a new job.
It’s easy to blanket say “web scraping is legal, do what you will“. The tricky part is protecting people’s public data while not giving a huge moat to giant corporations who control it.
That's the thing. Web scraping isn't really the problem here. It's what companies are doing with personal information. If LinkedIn started doing the same thing as HiQ, it would be just as bad (probably worse), but the legality of web scraping is irrelevant to that.
That's a good point, and we certainly should write our data protection laws to prevent LinkedIn from doing the same thing — but there's a crucial difference between the two.
I've consented to give my data to LinkedIn, and I can withdraw my consent and data if they start doing something I don't like. On the other hand, hiQ has vacuumed up my data without my consent, and there's really no way for me to stop them other than retreating from my public profile. Certainly the legality of web scraping is relevant to that.
I guess that's the issue. We need laws to control what companies can do with personal information, even if it happens to be publicly available. I don't think scraping itself is really the issue. If you used mechanical turk to hire a bunch of people to go look and user profiles and write down information about them, you'd have the same problem.
This case does not make web scraping legal or illegal, it just sets a precedent that Computer Frauds and Abuse Act (CFAA 1986) cannot directly be applied in web scraping of public data for fair use. Web scraping can be legal with caveats which is, as long as you are just scraping a public information and do not re-sale it for profit without significant transformation and value addition i.e. a fair use doctrine is applied.
If you use bots to login using a username and password and than scrap the information, it's still wrong and infringement, as action of login binds you to an implicit contract with this website and by logging in you accept those terms. The ruling on HiQ vs LinkedIn is quite nuanced and if you are crawling and than repackaging that information and selling it somehow then there is a highly likely chance the precedent in this case won't apply.
Technical barriers like rate limiting and captcha are legitimate way to guard against not just web crawlers but denial of service (DoS) attacks, so it's about how it's been put forward by lawyer. So in general website can still continue to block like google, facebook, amazon and all the big sites are doing it. It's not illegal, I found a better deep understanding of this case based on explanation from a EFF. [1] [2]
Anyone ever read any stories about the concept of "agents" where everyone had their own computer agents that did all the work and talked to other agents do all the steps like book your flights, tickets, order your food, collect/collate data for your purposes, etc?
We need to make the internet "Agent" friendly.. we should stop assuming the end user (end human?) will ever see any webpage on the internet.
Having an agent proxy for you won't last long. If that agent has connections to your friends' agents, your place of work, etc., then the protection provided by the proxy whittles away quickly.
To find you, all anyone would need is a photo of you, match it to an existing photo tagged as your agent, and your agent's online presence is clearly identified.
You could rotate to a new agent every time you go online, or randomly while online, but forget about making purchases, maintaining friendships or building "cred".
The only sure way to separate you, the person, from your online presence is to not have an online presence. For the company in question above, the only way to isolate yourself from their "running to tell Mom" business model is to simply not put your resume up on LinkedIn.
This ruling is troubling in many ways. HiQ can scrape the web, and now, apparently, so can every other personal information brokers out there. (Ref: https://en.wikipedia.org/wiki/Information_broker, which naively suggests this information is only important to advertisers; it is also of interest to governments, police, employers, even the parents of the S.O. you intend on marrying, or the S.O. themselves.)
Is there a company that will track your online posts while you are supposed to be working, and "run and tell Mom" that you were doing something other than work? As an employer, would you pay for such a service?
I'm not even convinced that air-gapping your person from the internet would work in this new world. A lack of online presence, while not damming in itself (yet), could indicate "something to hide."
> The CFAA is adopted to prevent deliberate intrusion on someone else’s computer — in particular, computer hacking
hacking is commonly used to mean different things, and we imply from the context what the author really meant.
Do any lawyers know if computer hacking has a legal definition, or if this decision will lead to a specific legal definition of the term?
From what I understand it is very loose and can mean "you accessed our systems legitmately but we didn't want you to".
I think it was last year that facebook went after fake likes / fake followers companies who were logging via the login page and they liking / following. Companies in China / New Zealand and I think NY? got threatened with the CFAA (unsure if it went further) but it made the tech news at the time.
I don't think anyone hear would call writing a bot that logged in with actual details and performed an action (the same a real person could) as hacking, but facebook were saying as it broke the TOS, it was unauthorised, thus hacking laws apply. With their budget, I guess they get to decide what hacking is
So, the whole Google's search is based on scraping publicly accessible information from public web-sites. Nothing to be surprised about that the practice is declared legal. Otherwise they should have demand for Google to basically liquidate Google search unless it signs an agreement with every single site in the index.
IANAL, but IMO LinkedIn goofed by using the CFAA. eBay and BiddersEdge had a similar case, and eBay won, preventing scraping of its website using trespass:
I can see both sides of this, and am not sure which is better. If a site is not allowed to block traffic, does that mean blocking a DDOS attack is illegal? What if the company doing the scraping is so aggressive about it that they are impacting performance?
One fair way (to me) would be to figure out how many pages a typical user looked at per month, and if a company is scraping the site, they are limited to accessing it like one customer; they can't launch 100 threads to scarf down data as fast as possible. This probably wouldn't work for most scrapers' businesses, but it seems reasonable from the origin web site's POV.
Another possibility would be to require web sites that accept and then publicize user submissions to give users a choice about whether they want to make their information public. If they do, the hosting site has to include the information in a publicly-accessible feed, like a daily compressed download. This seems reasonable to me, because the hosting site is getting all kinds of information for free from users. It's not like it would take a lot of effort to do a streaming JSON dump of the stuff that changed every day, and if necessary, they could throttle the download rate, but not so low that the data for a day couldn't be downloaded in a day. Ie, a competitor might be a day behind, but no more. Of course, the competitor would have to publish a feed too, allowing other people to lunch off of them just like they are lunching off other sites.
With LinkedIn, all of the information they obtain and publish is submitted by users, so for LI to say they "own" this information makes no sense. At least with eBay, the current auction price is not not something the user submitted, assuming it has bids.
Ugh, this is yet another poorly written coverage of a court case that is more complicated than simple snippets and one liners allow.
This absolutely does not say that web scraping is legal (the ruling specifically notes other ways that it could be illegal).
It doesn't even say for sure that web scraping isn't illegal under the CFAA. This is a ruling about a preliminary injuction and as such all it determines is that HiQ has a good enough arguement and valid enough concerns that a preliminary injuction is warranted until LinkedIn wins a suit against HiQ to force them to stop scraping.
Take the time, and read the linked ruling. Don't trust poorly written and researched articles from companies that stand to gain from a particular interpretation.
I have a question about the company in the article, HiQ. Their website says they identify flight risks. Does that mean if your company has a contract with them, then they will notify your boss if your LinkedIn looks like you are job hunting? Haters.
I guess this means the web is lawfully seen as belonging to the public which seems both good and bad, because while it might be publicly accessible as a whole, each individual site is privately owned.
I'm pretty sure with any brick and mortar business, as a business owner you are legally allowed to disallow service to anyone without having to provide a reason.
That would be the equivalent of allowing to block bots / unwanted traffic on your site, except the brick and mortar case is way easier to prevent. If the person won't leave your physical business, you can have them removed by the police but to prevent a bot the burden is fully on you as a site owner.
> I'm pretty sure with any brick and mortar business, as a business owner you are legally allowed to disallow service to anyone without having to provide a reason.
It depends, you aren't allowed to restrict access for privileged reasons (i.e. banning all black people from your business is a clear case of discrimination and is specifically illegal), but otherwise you have a lot of levity to refuse persons entry - but that "without having to provide a reason" is just apparent because it is very infrequently challenged - if someone feels they were wrongly refused service they absolutely have the right to sue and require you to state why that ban took place, but, as your business is owned by you a ban made by you is legal until overturned, either early (as in a preliminary injection) or as a result (after the ruling on a case).
It's hard to pull real world analogies for this case, as well, since Amazon is never worried about other customers being dissuaded from a purchase because I'm shopping in my boxers, or am shouting and creating an unsafe environment - so there's a limit to how well you could tune a real world example to this scenario... but I might give it a try with this:
I run an absolutely 'grammable business, there's bubbly pink champagne and gold everywhere - including the stoop. I like to use that fancy stoop to attract new customers but I've been a bit concerned lately that a city tour guide has regularly been bringing around their tour and they're taking selfies of my business from across the street - that tour guide is profiting from the decoration I've put into my stoop and some of those tourists might've been willing to pay the modest 300$ fee to walk in and take selfies inside! So I try to get the city to restrict the tour guide from being able to lead his tourists to take selfies from across the street. It's very possible I could remove the glitzy decorations from my stoop but that would impair my advertising, which I don't want to do - instead I want to legally restrict people from using my stoop in a manner I did not intend.
I think that's pretty apt in a few ways but it also misses the fact that LinkedIn is technically paying a bit of money every time someone loads a page - and if the tour guide was noticeably damaging my stoop I would be able to sue him for damages and probably win, but the cost of that page load is so utterly negligible that maybe we can ignore it.
> as an owner you are legally allowed to disallow service to anyone without having to provide a reason
as a website owner you absolutely have the right to limit who does or doesn't visit your site, but that's up to you to enforce. Scraping is literally equivalent to reading a giant public billboard and writing what it says somewhere else. How could that be illegal?
> Scraping is literally equivalent to reading a giant public billboard and writing what it says somewhere else. How could that be illegal?
If you treat the website you're visiting as a privately owned business then you're trespassing on private property by scraping their site, assuming that site is trying to prevent that behavior.
If you don't treat the website as a privately owned business, then what do you classify it as? It can't be considered in the public domain because someone owns, operates and pays for the resources to make the site work. It is the site owner's private property IMO. By having the site public, they are inviting the world to check it out but they should have the right to disallow service.
Also, if the internet is supposed to be distributed and each site is an independent node on the system, how is that any different than an independently owned brick and mortar business operating in some location? In this case the physical world is "the internet".
It's not a clear cut thing, and I hate the idea of censorship, but I can't see this case's outcome becoming the norm. There's too many loop holes. Like, is going to your site 50 thousand times a second hoping to get new information from the public billboard a legal move?
Or, to put it another way, I'm pretty sure if you were able to mind control people and you commanded a billion people to flood a physical business so that it could not operate and serve its customers this would be quickly seen as an unlawful move in the physical world. You would probably get shut down by the state or government too for disrupting service for neighboring businesses and citizens.
You treat it as a public-facing building storefront, with you standing outside and taking pictures.
As long as there are big glass windows, the patrons and the information gained from looking inside are not subject to privacy, and are thus freely accessible to anyone walking by.
What you're not entitled to is the backend workings of the website - the application/code/databases, and credentials. It would be akin to going into a restaurant and forcing yourself into the food prep area and stealing the owner's keys, and contaminating the food that patrons eat.
Scraping too often would be something akin to setting up a camera tripod right in the building entrance, hindering the influx of new customers. While not explicitly illegal, it does hinder the operation of the business, and you will probably be physically removed at some point (this is banning abusive bots, etc.)
> Yeah, but this is slightly different. This is about a business owner trying to control who can look through the windows.
Is it any different tho?
I don't think there's a clean analogy for window shopping on the web.
If you visit example.com to take a look at the site then you are really inside of example.com's business / store front. Their servers are spending resources to serve your traffic. The doors have been opened.
I also think a brick and mortar business has the right to prevent you from (literally) window shopping too because if you enter the land that they own (or are renting) and they do not want to serve you then you are trespassing on private property.
This makes me think of Aaron Swartz. If he had the strength to keep from harming himself, his whole sentence would have likely been vacated. This is a reminder to those of us with suicidal thoughts: things can always get better.
Please not that I am personally more pro scraping that anti. However, this article's interpretation of this ruling (which is from September 2019) is overblown and suspect:
1) This is the 9th circuit only. The 9th circuit is known for being very liberal in these kids of rulings. There is a high likelihood that this ruling could be interpreted differently in other circuits and when it goes to the supreme court.
2) Saying "it makes it legal" is more of a technicality with this ruling and NOT practicality. There is still a LOT open to interpretation with this ruling (see 1, above). And lot of what it's saying just isn't found in the case. And again, it doesn't apply everywhere.
3) This is actually a much better summary of the case: https://www.natlawreview.com/article/data-scraping-survives-... In fact, there's a number of better summaries are out there by people who actually understand both technology and law from when it happened (again, back in September 2019). Here's some others:
Honestly, it comes back to a lot of existing issues, like whether facts can be copyrighted (e.g., this is why recopies alone can't be), whether logins imply privacy, etc.
I personally think the ruling is in the right direction but again, I don't think this source deals with any of the complexity or the reality of what the ruling does (or still doesn't) mean.
Finally some common sense. I’m also somewhat impressed that it “only” took 3 years from the initial lawsuit to reaching this precedent. I expected this to go on for a lot more time.
This only affects the ninth circuit—which includes the tech hubs San Francisco, Seattle, LA, and Portland. It would only apply to the rest of the country if the Supreme Court affirmed it. Even then, a well-funded company or zealous prosecutor could say that it doesn’t apply in your case because of some technicality. In that case you would need hundreds of thousands or millions of dollars and a few years to litigate the issue with no guarantee of the outcome.
> This only affects the ninth circuit—which includes the tech hubs San Francisco, Seattle, LA, and Portland.
It is only binding precedent in the Ninth Circuit, it is less accurate to say it only effects the Ninth Circuit, since decisions have effect other than as binding precedent.
Yes, that's why it is not binding elsewhere, which doesn't mean it has no effect. Particularly, if another Circuit has not issued a conflicting ruling, the Ninth Circuit ruling can be cited in and relied on by trial courts in that circuit as persuasive, rather than binding, precedent, so it can have an impact from the very earliest stages of the process.
A circuit split is also a reason for the Supreme Court to take a case, so the Ninth Circuit decision, without being binding, makes it less likely that any conflicting decision by another circuit will be the final resolution of the case in which that conflicting decision is issued, which is an important effect at the other end of the process.
Across the land within their circuit, over matters within their jurisdiction. Elsewhere the ruling is merely advisory in nature.
It gets tricky with nationwide actors though.
Besides some specialized topics like patents and international trade, nationwide orders and injunctions are the sort-of exception, which are based on a courts local jurisdictional power over a non-local nationwide actor.
I'm not sure what the generalized name of the principle (beyond injunctions) is called (I've heard it as a type of jurisdictional overreach), but the presumption as applied here is that LinkedIn, being in the 9th circuit's jurisdiction, also adhere to the ruling outside the circuit, absent a contradictory ruling by a different circuit.
(One of the usual requirements for the Supreme Court to even hear a case is that different circuits have conflicting rulings on a matter.)
But it looks like the Supreme Court is about to seriously reign that nationwide power back soon, at least for judges issuing orders to departments/actors of the executive branch.
> One of the usual requirements for the Supreme Court to even hear a case is that different circuits have conflicting rulings on a matter.
That's not a “usual requirement”, it's one of many factors that can weigh in favor of the Supreme Court exercising discretionary appellate jurisdiction (and it's one that weighs very heavily in favor of it, even when no other favorable factors are present, since federal law meaning the same thing everywhere is an important principle.)
Circuit court rulings are usually only binding precedent within their district. However, the Court of Appeals for the Federal Circuit has exclusive appellate jurisdiction over certain subject matters (e.g., patents), so it's supposed to follow appropriate precedent for stuff outside its remit and its precedent is binding on everybody for stuff inside its remit.
That said, it is not unusual for a court to look to rulings in other jurisdictions to decide a matter if there is no binding precedent in place. They are not required to, however.
The article's title gives me the cognitive dissonance. Glad it was changed for HN. To add my own interpretation, "US court appears to legalizes X and technically prohibits it" -- what is this "it" the court prohibits, and why do I want to assume the title's subject, which doesn't make sense at all. You don't "legalize X and technically prohibit X". Was the originally intended title truncated somehow?
I always thought scraping was a fun idea, I just couldn't find the right use-case for it. I'm not a sports guy, and the big sites have pretty extensive API's. Something music related would be of interest to me, but I can already get updates on events like concerts (since the monopolies make a fortune selling tickets). I'm not sure what could be useful.
I once scraped job sites to get a mile high sense of the tech industries in cities I was considering moving to.
EDIT: and to add, I've always dreamed of a website that gathered job, rent, entertainment, etc data on cities you're interested in to help figure out where are good places to move to. A site like that would have to be pretty much entirely built from scraping.
I build https://awardfares.com together with a friend which scrapes airlines' award seat availability. Airlines' websites are horrible from a UX perspective so scraping the data and presenting it in a better way was a pretty obvious use case.
In the past the CFAA has been wielded haphazardly and even maliciously [1]. As noted by others these cases are far from over, but hopefully we’re shifting towards a healthier balance between open web and private information.
Native apps are generally a lot easier to scrape since they rely on an API which can't be changed willy nilly without breaking compatibility with older apps. Also you can't do captcha etc on API calls in the same way you can on websites. And of course the data is neatly formatted to be machine readable.
Why do we have this editorialized title, @Dang? Why not the actual, neutral, title: "US court fully legalized website scraping and technically prohibited it"
BTW: This title is still not factual, but at least it's more neutral. It seems to be an injunction, which doesn't really mean "fully legalized" but what do I know....
Anyone know the laws around scraping news article meta-data(image, headline, publisher, url) directly from a publishers website? As long as I'm not scraping the full article, could I have my users share the scraped meta-data between each other?
Dangerously misleading headline. There is a circuit split on the interpretation of the CFAA, including on what counts as “unauthorized access”. If you are sued outside of the Ninth Circuit, you might get very different results.
Reading this while heading in to work on a LinkedIn shuttle brought a smile to my face. I've always been embarrassed my company took this position given how much scraping LinkedIn does. Do as I say, not as I do I guess?
This wouldn't supersede any violations of the ToS, though. I guess it would mean they can't sue you (maybe?) but they could still kick you off their service. Could they still go after you for breach of contract?
ToS aren't signed. They are terms of service, not license agreements.
ToS merely need to be disclosed, and the site cna decline service if you violate the terms. A license agreement may subject you to penalties for violating the agreement.
If your site has a substantial amount of crawlers, that’s an opportunity to offer an API subscription.
No one wants to “crawl” your data, they just want your data.
No software developer or PM ever said “it’d be great if this broke every couple of days and we had to scramble to figure out how to fix it” or “I’d love to jump through a bunch of hoops to read that H1”
Bot detection and crawling is an arms race. Bots will always win. I’ve droned on about this here in the past. [1]
Anecdotal evidence: Previous co-founder and software architect of a “crawler” that crawled ~17000 store and deal sites a day (like BestBuy, Gap, RMN). We circumvented all bot detection and were only busted and c&d’d once. By CouponCabin of all people.
I have a feeling this precedent makes Google happy, if not that they helped hiQ. While scraping should be legal, nobody should be preventing server owners from using any countermeasures they like
Not sure why bots are guaranteed access as if they are a normal user? Can I bar a mechanical bot from entering my open-to-the-public bodega? I think I might.
Just because something is possible technically doesn't make it ok legally. I think there's still various issues though, as per GPDR I don't think another company can just copy that data from Linkedin. That it's easily visible doesn't matter for GDPR.
Of course it's legal. Anyone holding other ideas has a fundemental misunderstanding of how the web and public spaces work. If you want something to be private then you make it private. If you make it public for the whole world to see, explicitly going out of your way to send the data to anyone who requests, it's all on you. Just because someone isn't using Chrome doesn't make them unethical.
[0]: Clean-room reverse-engineered server-side replacement for Flutter, written in WebAssembly, which compiles source files to an .EXE using a .NET-enabled WebAssembly precompiler.
it doesn't look like from this post that web scraping is authenticated to be a legal activity, it's more of what can be scraped by a bot and what cannot.
Of course it's legal. I don't know how you can make the argument that, if you have a public-facing website, you can arbitrarily make rules concerning exactly how that website is viewed.
Their only api as far as I can tell is their “custom search engine“, which doesn’t appear to match the results of their public search and is ungodly expensive.
If it's not behind a login, it's free game for scraping. CFAA only explicitly applies once you (the service provider) enters into an explicit relationship of use with someone else.
I.e. If LinkedIn made an account necessary to even browse the currently "public" data, and set limits as a condition of being a user that would be fine, and may give them a leg to stand on for future claims of misuse of their systems since the behavior is already covered by a default terms of use in creating an account.
In terms of making currently public data more difficult to access in response to scraping, I foresee business evolving even more in the direction through which everything gets locked behind an initial contract step to ensure that people can reserve the right to refuse access.
I'm very wary of the whole malicious interference with a contract reasoning that hiQ employed. Just because you're relying on a public data source to be accessible in a convenient way to you to deliver a service to an unrelated third-party should not obligate the party of the public data source to deliver data in a convenient way to you. That type of thing would lead to ridiculousness like being able to sue the makers of the Yellow/White Pages for instance for changing things up.
Further, I still see grounds for remedy to LinkedIn in that they do still have a right to refuse service to anyone, regardless of outstanding contacts entered into by that person. The only leg to stand on that I can see for hiQ is if the court limited the precedent being set in the event that LinkedIn identified them, notified them and requested them to cease and desist, hiQ refused (or hiQ offered to enter into an alternate business arrangement and LinkedIn refused to accomodate), and then LinkedIn implemented the interfering measure. In that circumstance, I can see a malicious interference maybe being upheld, but I still have very little sympathy for hiQ because whether the pages are public or not, they are imposing a significant cost on LinkedIn to serve that data if his is systematically enumerating the entire public LinkedIn dataset.
To me, hiQ is a guy OCR'ing the White Pages/recording a storefront 24/7. The details of how the Internet operates changes that a bit, but nevertheless, it is there.
It is a fundamental continuation of a trend I've noticed in Tech in that it is a field where practitioners/the businesses practitioners write code for for whatever reason assume that the platform running/servicing code they write fundamentally "belongs" to them. That any data their program can get access to is free-game.
I don't know where the disconnect is, or if maybe I'm just a freak in that I treat someone else's hardware as if I were entering their home. Even with the code I write.
You don't just waltz into someone else's space and start using all the amenities without asking. It's rude, and grounds to have admittance refused in the future. You don't do things to the system without asking.
It's like... Imagine most people are blind. You're a salesman walking into their space talking a good story or providing some service while completely ransacking their domicile for every shred of info you can possibly take with you. Filming the inside, the layout, rifling through mail and rolodexes, so on and so forth, while all they are aware of is explicitly what you tell them is going on.
As one who writes software, and automates tasks with computers, I feel it to be our duty to not facilitate that sort of behavior. I just wish I could figure out a way to spread that ethos and keep the bills paid.
Otherwise I fear we'll lose any benefit of an assumption of integrity we've managed to build up. Hell, maybe it's too late for that given the way things are going.
I'm seeing some confusion about how this affects people outside of the West Coasts 9th circuit. It certainly affects you if you are in other circuits. There's no question that this precedent will be brought up in other courts, even if they aren't bound to uphold the decision.
IANAL, but a bit of a supreme court hobbyist. For those unfamiliar, a short lesson on how the federal court system works that I think would be useful for HN readers:
Generally speaking, a case first goes to District Court. There are 94 districts in the country and they tend to consist of small regions. For example, Northern California (note these are for federal courts... each state has it's own state system, which operates differently in each state). So you first go to one of these courts when you bring a lawsuit.
Now, the lawsuit is decided and one side is unhappy with the verdict. In a lot of cases, they might say, "ok, I'm unhappy, but I'm also spending a lot of money on lawyers, so i'll accept the decision and move on". Or they have the right to appeal the decision. If they choose to do the latter, then the next layer of the court system, the Circuit Courts come into play.
The Circuit Courts consist of 11 "circuits" plus the DC [0] and Federal Circuits [1]. When you appeal a case to one of the circuit courts, they have to hear your case. It's your right to appeal. However, if they think the case doesn't have merit, they can issue what's called a "summary judgement" where they issue an opinion without a full trial. In other words, if they think the lower court issued the correct decision and they think it would be a waste of time to go through a trial to appeal it, they can look over the facts of the case and make a decision without a trial.
At the next level up, you have the Supreme Court [2]. Unlike at the circuit court level, you have no right to appeal to the supreme court. Generally speaking, they get to decide which cases they want to take, so if you think that the appeals court screwed up and the supreme court decides to not take your case, there's nothing you can do. Unlike appeals courts, they aren't even obligated to look over the facts of the case at all if they don't want to.
Instead, what happens is that you petition the Supreme Court to hear your case. So you lose your case at appeals court and you basically file paperwork with the SC saying "please please hear my case, here's why i think you should".
The SC turns down a lot more cases than it hears. So what makes them take a case on? One is if the case is super super important. Something like a case against Obamacare or something else of very high national significance. But typically, the SC takes a lot of boring cases too, and for the most part, this has to do with circuit splits. A circuit split is when two of the regional circuit courts issue conflicting rulings. So for example, if every circuit more or less rules the same way on a given issue, there's nothing for the supreme court to decide on. The system is working as intended. But if two circuits disagree, then the SC's job is to resolve the issue so that federal law is applied uniformly.
So in this case, the ruling was issued in the Ninth Circuit (which covers California and much of the West). Technically speaking, I can sue someone over scraping my site in Wisconsin (in the 7th Circuit) and the judge can rule in my favor (that scraping is illegal) since she isn't bound to follow 9th Circuit appeals rulings, whereas a district judge in Colorado or Montana (both in 9th Circuit) are bound by the appeals court precedent. And then they appeal the case in Chicago and the appeals court also isn't bound by the Ninth Circuit ruling. But you can be damn sure that the lawyers for the defense are going to bring up that Ninth Circuit case as precedent. And generally speaking, the precedent does matter (circuits don't want to create circuit splits).
Coming back to this case, does this mean that you have carte blanche to scrape websites anywhere in the country? No, case law is going to need to evolve more to get to the point where you can safely think that way. But, this is definitely an important step in that direction.
[0] Washington DC has its own circuit, even though it's just a city, not a region. This seems odd at first glance, but in fact, a lot of lawsuits against the federal government come through this circuit, which is why DC gets it's own circuit, while for example, NYC does not.
[1] The United States Court of Appeals for the Federal Circuit is a special case. Most of these circuit courts have to do with regions. So if I commit a federal crime in Florida, it gets tried in a Florida District court and then it gets appealed in the seat of the 11th circuit, which is in Atlanta. However, certain cases, based on subject material, don't get appealed in Atlanta, but go instead to the Federal Circuit. These tend to be things with national ramifications. Patents are a prime example, where you really really don't want a patent being enforced in Iowa, but not in Alabama. So for these special cases, we set up a different appeals system.
[2] There are a few times when cases go straight to the supreme court. For example, in disputes between two states or cases involving ambassadors and other public ministers, a case might go straight to the SC and skip the lower courts. But this is the exception, not the rule.
No ultimate decision was ever made, and no, this doesn't make web scraping 100% legal. Wake me up when there's a new announcement because anyone interested in this already know this old news.