So, people feel some sympathy for the author of this submission because they can see work is being taken without being paid for.
> Defending copyrights as an individual isn’t an easy matter.
But the same people seem to fucking hate Article 11, which would forbid these sites from copying the entire article and would give the author of the material real options to take action.
I don't think article 11 or 13 are particulary good, but this submission does a good job of showing why some people think they're needed.
Existing copyright law already forbids you from copying and redistributing the creative works of other. More laws won’t stop anyone from violating the law. The problem with content scrapers is that its so easy to hide your identity, so easy to scrape content, and so damned easy to be approved into advertising programs.
Fines should follow the money if they can't resolve on the party directly responsible. Advertisers can buy insurance and insurance companies can vet domains.
It means you can sue Google for damages, so that Google's responsibility becomes to prevent copyright thefts. That way, you don't have to run around trying to chase down a phantom no-name dude on the internet.
And if you can keep the copyright-theft'd works off of Google, Bing, and other big name search engines... then you've basically won in this day and age.
--------
Yeah, its an extreme measure. But its pretty clear cut how Article 13 helps in this case. Link Tax (Article 11) may also play a role, but its less obvious IMO.
This isn't something that I've seriously looked into myself, but I thought that in the past, Google had a way to tell them about your new articles, to avoid this problem. Does that no longer exist/work?
You can “ping” them about new articles (which I still do), but if other websites ping them within a few seconds/minutes than you basically have a race-condition trying to see how is indexed first. Notably, this is a bigger problem with Bing and Yandex who don’t have the capacity to index the web as quickly as Google does.
For WordPress blogs, pings are sent immediately to ping servers. Google could reasonably crawl a new article within one second of the author clicking the publish button. So if you have automated pinging built into your blogging workflow, then there should be no problem. At least until high-frequency traders get involved in this market :)
Is this just not a problem in practice, or is there some technical barrier I'm not thinking of here? I'm imagining a system where you say "hey Google, I'm about to post content X" that you do right before it goes live on your site. No race condition.
If you ping Bing or Google before you publish, they’ll get a 404 and will take that as a sign that there is no content there. They also will wait longer before trying to reindex a page that previously returned a 404.
Make a visible but unlisted URL, ping search engines, wait 5 minutes, list / link the URL from the home page of your site, publish the URL to RSS. Solves the 404 problem, doesn't it?
You can get them crawl a new page in webmaster tools (or whatever name it has this week) and they will index with a few seconds - only if it's an established site though, doesn't work so well on new sites.
Of course, this doesn't prove your page has the original content though.
I’ve actually done something like this in the past. I was able to identify the IP address of a few content scraping websites, and then made my RSS feed return a huge number of gibberish pages when requested from these IPs. I’m not sure whether they eventually ran out of database space or what happen, but these sites did go down after a few days and stayed down.
https://www.ctrl.blog/entry/defacing-content-scrapers
(I didn’t attack these servers, by the by. They came to my servers and gobbled up all the auto-generated junk I served them all on their own.)
However, in recent years it has been more and more difficult to identify the right IP address as everything is hosted people are hosting behind Cloudflare or do the actual scraping from a short-term lease server with an unique IP.
Actually, shortly after I learned that Google switched to neural networks for search results, I noticed an old style of spam making a huge comeback: Either markov chain or neural network generated text, taking from related websites (I suspect they look at the top 10 websites for certain valuable terms).
This gibberish actually outranks legit content which refers to my content, sometimes even my own articles, especially when it is turned into a PDF.
Seems like it is easy to block ~250k webpages like:
> Seems like it is easy to block ~250k webpages like
I think it's unfair to look at these results and say "but it's so easy to block these". Google's time is best spent on solutions which will reliably and automatically block these, without going through fairly manual steps.
I did notice a decrease in quality in gmail's spam filter though. Increase in false positives and false negatives lately. I guess it's unlikely to be related...
To me it seems all the work of the same spammer(s). In such a case, do some manual intelligence and wrap it up. It won't scale to all forms of spam, but if a simple regex can uncover 250k+ results in 10 minutes, a manual spam fighter can still block millions of pages a day (and warn the webhost, remove these flakey ads from their networks, etc.).
No doubt the recent machine learning hype has given spammers more advanced tools to avoid detection.
The problem is a matter of audience and missing communications IMO.
In the old usenet we do communicate DIRECTLY each others so anyone have a reputation, so we know "original sources" and we prize accordingly. In today's world being disconnected, on one side only content producer, on the over side only consumers, in the middle at maximum uncomfortable platforms that are limited and limiting user-user communication it's harder.
IMVHO the medicine is coming back to the communication era we lost in the past, no other systems proved to be effective, take a look a audio/video piracy as a good living example.
Never heard of this blog before. I really like the looks of the content. My only method of subscribing to anything is RSS. I'm definitely not adding it to my list after reading this. I wasn't going to make them any money by unblocking trackers but that Amazon tip jar[1] thing looks really interesting, I've never seen it before, I likely would have used it.
It's a horrible band-aid that hurts the functionality and reputation of RSS and shouldn't be encouraged by supporting the site setting a bad example. If they want to play the pointless cat-and-mouse game with the "thieves", fine, but crippling RSS in the process is not cool. They will just scrape the site instead and the RSS feed will be worse off for no gain.
Crippling? I'm as information addicted as the next guy, but unless I'm missing something fundamental about RSS, the only effect is that you as a user will notice new posts a few hours late. That seems more than reasonable to me.
I didn't look around at the rest of this blog, but I don't see anything here that's particularly time sensitive. I would have been just as interested in delaying RSS today or tomorrow.
That's not the point, it's making RSS a second-class citizen. That's bad enough on its own. It's doubly bad in this case because they went from treating it as special. If it had just been made equal, I wouldn't object. This is clearly a problem best solved by the search engines, if the world fixes it with this technique, there will be no push to fix it the right way and RSS will be worse off than it already is.
> Defending copyrights as an individual isn’t an easy matter.
But the same people seem to fucking hate Article 11, which would forbid these sites from copying the entire article and would give the author of the material real options to take action.
I don't think article 11 or 13 are particulary good, but this submission does a good job of showing why some people think they're needed.