Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Now many site owners are trying to put technical obstacles to competitors who completely copy their information that is not protected by copyright. For example, ticket prices, product lots, open user profiles, and so on. Some sites consider this information “their own”, and consider web scraping as “theft”. Legally, this is not the case, which is now officially enshrined in the US.

Does this mean we can now scrape e.g. YouTube videos, Amazon reviews, IMDB reviews, Facebook events ... ?



Yes you can scrape them, no you cannot repubilsh them. Everything you listed is protected by copyright. You cannot infringe on copyrights because of this ruling.

>hiQ argued that LinkedIn’s technical measures to block web scraping interfere with hiQ’s contracts with its own customers who rely on this data. In legal jargon, this is called” malicious interference with a contract”, which is prohibited by American law

Does this mean that Google's random recaptcha check is interference?


I think any ruling that says LinkedIn can't put in protectionary measures against automated requests is doomed to be overturned, as long as they're not doing it discriminately. Captcha, rate limiting, user agent testing, etc are all common tools to protect against malicious/unintentional denials of service. The question is what was LinkedIn doing, and did it specifically target hiQ while permitting others of the same class of traffic.


Why would it be an issue if it is discriminatory? Linkedin can use its servers any way they like, unless they ve promised their users that their data can be scraped indiscriminately


Because of the court case. This is just an injunction pending an actual decision.


I'm curious how entities like https://www.omdbapi.com/ can continue their activity, get $$$ and not get shut down.


Yeah what is the line here? Would it be against the rules to block known user agents, throttling of traffic?


No, because what one side of a case argues is not the law. What judges decide is the law.


Probably not. Facts aren't copyrightable but creative works are.

So prices on Amazon.com are facts. User reviews are creative so probably copyrighted.

Similarly the videos on YouTube are copyrighted. However the number of views and the number of likes are probably scrapable.


See that's where I have problem with this. Isn't data just _data_?

Lets draw some pararells to real life. If I go to public space like town square - can't I take pictures, notes and records then go home and draw my analytics from it? What if I read something in a book I bought, can't I quote it?

Same thing should be with web resources even if they are creative - as long as I don't publish them I should be able to scrape whatever public resources I want and use them in my analytics, machine learning or whatever.


This is why I strongly prefer the Dutch term 'auteursrecht' (author's rights) as opposed to copyright. Copyright has this annoying incorrect connotation that it has anything to do with copying when it's really publishing that it should be limiting.

Downloading publicly available data should (by definition of public) not be a violation of someone's rights. However it's easy to see why it wouldn't be desirable for someone to republish creative works as their own, so it's reasonable to give the author control over how their work should be published.

And in the case of price data or similar you would be hard pressed to deem anyone the 'author' of it, hence it would be weird to enforce the author's rights.


>Copyright has this annoying incorrect connotation that it has anything to do with copying when it's really publishing that it should be limiting. //

Copyright does make _copying_ tortuous. Broad personal use exceptions in USA, for example, make this appear not to be true, but it is the act of copying - even without publication - that is protected in general.

Ripping a CD in UK, for example is copyright infringement without a general personal use exception (there are exceptions, under Fair Dealing, but whatever you're doing almost certainly doesn't fall into them).

See eg UK CDPA1988, Chapter II, section 16(1)(a); or USC17, Chapter 1, 106(1).


You are discussing the fair use provisions of copyright law.

Not a lawyer, but:

You can do all of that, but:

You cannot scan the book you bought, and put it on your website for sale or even free - unless it's copyright is up or you are given permission by the copyright holder.

You can not take a picture of someones painting in high detail, then sell prints of it - unless it's copyright is up or you are given permission by the copyright holder.


In addition, there are some buildings and landmarks that you can't simply take photos of and then resell

https://www.rd.com/advice/travel/eiffel-tower-illegal-photos... http://www.photographers-resource.co.uk/photography/Legal/Ac...


Your examples are really wanting greater freedom to copy rather than about the distinction between data and creative work. Copyright is supposed to encourage people to make creative work, not encourage people to record existing facts. I think this distinction is important because creative work isn't actually necessary to anyone else - they could create their own different one if they wanted. But data might only have one correct value and if that was locked away by copyright, it would limit other people's ability to do things that can't be done with some different data.


As far as law is concerned, data is not just data -- bits have colour:

https://ansuz.sooke.bc.ca/entry/23


Additionally, some public areas prohibit photography of architecture because of copyright.

https://www.diyphotography.net/10-famous-landmarks-youre-all...


> Isn't data just _data_?

Think of Law around data as using dependent types. The legal protections depend on the type of the data, and the type depends on the content (among other things). You have to determine the type BEFORE you can tell what the law says about it, since the law only cares about the type. You could probably encode the law nicely with something like Idris, but any "code as law" type governance system without dependent types won't be able to express existing law.


> Isn't data just data?

No. At the risk of just repeating the comment you didn't understand, creative works are not "just data" - they are copyrightable works that the owner has control over who can use them, not just for profit, but for any reason with few exceptions.

You don't just get to drop someone else's work product into your algorithm without their permission.


There are cases where "dropping into your algorithm" would count as fair use such as a search engine of copyrighted content.


> You don't just get to drop someone else's work product into your algorithm without their permission.

Why not?


Because copyright law exists.


I don't think using data as input to an algorithm necessarily breaks copyright law.

I can read a book to post my impression on it somewhere right? I can read it and say "it was beautiful" on twitter.

I can then automate my "taste meter" through machine learning, it reads a given book character by character, and spits out what I'd think of it if I actually read it. Then posts it on twitter, says "it was beautiful".

Did I break copyright law? I don't think so.


You can't take something copyrighted by someone else and re-distribute it without their permission. However, I suspect you can capture it freely if you don't re-distribute it.


I think the fashion industry should exert their right to have their work removed from photographs.


Neat straw man but you're actually proving my point. There are scenarios under which they can't do that (fair use) but there are also many scenarios where they would be entirely within their right to do so.


> User reviews are creative so probably copyrighted.

I wonder if the number of stars are copyrighted. It's not creative, but a fact.


Probably not since each star review is a separate "work" by a separate author. Mechanically combining multiple non-copyrightable things into one doesn't make it copyrightable. If Amazon arranged their users' star reviews into an infographic that would be copyrightable.


Why would a review be a copyrightable creative work, while a LinkedIn resume wouldn't be?


I think perhaps the layout, cover letter, and maybe any flourishing notes are copyrightable, but the actual details of work experience and education are not.


Yeah, I would think the "description" section for each job would be copyrightable, but the simple "title", "company", "year" fields would not be.


There's some huge datasets of Amazon reviews available. Stanford has a big scrape out there, plus there's one from Amazon themselves in the AWS datasets.


Youtube videos are definitely protected by copyright, though.


In theory, right?

See the South Park WWITB issue.

I believe South Park used a videoclip from youtube, and Youtube’s ContentID system removed the video South Park had used, because Youtube considered it a violation of South Park’s copyright.


Just because YouTube gets it wrong doesn't mean it's just theory. YouTube is not the only site that has automated content scanning for copyright violations. Getty and other photo sites have gotten this wrong in the same way by sending C&D letters for violations to the actual copyright holders.


I was specifically discussing Youtube.


Shouldn't the copyright belong to the creator not to youtube? Basically youtube shouldn't be able to sue you, it should be up to the creator to do so.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: