Combating blog article theft by delaying RSS feeds

DanBC · on Sept 18, 2018

So, people feel some sympathy for the author of this submission because they can see work is being taken without being paid for.

> Defending copyrights as an individual isn’t an easy matter.

But the same people seem to fucking hate Article 11, which would forbid these sites from copying the entire article and would give the author of the material real options to take action.

I don't think article 11 or 13 are particulary good, but this submission does a good job of showing why some people think they're needed.

d2wa · on Sept 18, 2018

Existing copyright law already forbids you from copying and redistributing the creative works of other. More laws won’t stop anyone from violating the law. The problem with content scrapers is that its so easy to hide your identity, so easy to scrape content, and so damned easy to be approved into advertising programs.

DanBC · on Sept 20, 2018

Exisiting copyright law is, as the author says, hard to use. Making those laws easier to use for blatant commercial violation is a good thing.

3pt14159 · on Sept 18, 2018

Fines should follow the money if they can't resolve on the party directly responsible. Advertisers can buy insurance and insurance companies can vet domains.

Semaphor · on Sept 18, 2018

How do they help? What the author is describing was illegal before and will be illegal after.

dragontamer · on Sept 18, 2018

It means you can sue Google for damages, so that Google's responsibility becomes to prevent copyright thefts. That way, you don't have to run around trying to chase down a phantom no-name dude on the internet.

And if you can keep the copyright-theft'd works off of Google, Bing, and other big name search engines... then you've basically won in this day and age.

--------

Yeah, its an extreme measure. But its pretty clear cut how Article 13 helps in this case. Link Tax (Article 11) may also play a role, but its less obvious IMO.

Semaphor · on Sept 19, 2018

> It means you can sue Google for damages, so that Google's responsibility becomes to prevent copyright thefts.

Nope. You can not. Google is not a content sharing platform which is what Article 13 is about.

You can do the same as before: Send a takedown request to google. Again, literally nothing changes here.

hyperpape · on Sept 18, 2018

This isn't something that I've seriously looked into myself, but I thought that in the past, Google had a way to tell them about your new articles, to avoid this problem. Does that no longer exist/work?

d2wa · on Sept 18, 2018

You can “ping” them about new articles (which I still do), but if other websites ping them within a few seconds/minutes than you basically have a race-condition trying to see how is indexed first. Notably, this is a bigger problem with Bing and Yandex who don’t have the capacity to index the web as quickly as Google does.

testplzignore · on Sept 18, 2018

For WordPress blogs, pings are sent immediately to ping servers. Google could reasonably crawl a new article within one second of the author clicking the publish button. So if you have automated pinging built into your blogging workflow, then there should be no problem. At least until high-frequency traders get involved in this market :)

hyperpape · on Sept 18, 2018

Is this just not a problem in practice, or is there some technical barrier I'm not thinking of here? I'm imagining a system where you say "hey Google, I'm about to post content X" that you do right before it goes live on your site. No race condition.

d2wa · on Sept 18, 2018

Only Yandex has something where you can inform them of upcoming articles/links before you publish. (As mentioned in the article.) https://yandex.com/support/webmaster/authored-texts/owners.h...

If you ping Bing or Google before you publish, they’ll get a 404 and will take that as a sign that there is no content there. They also will wait longer before trying to reindex a page that previously returned a 404.

DenisM · on Sept 18, 2018

Make a visible but unlisted URL, ping search engines, wait 5 minutes, list / link the URL from the home page of your site, publish the URL to RSS. Solves the 404 problem, doesn't it?

d2wa · on Sept 18, 2018

I experimented with this too. However, I found that just delaying the RSS feed produced the desired results.

hyperpape · on Sept 18, 2018

It seems to me that Yandex is doing the right thing.

Why does Google not do the same? It seems to me that it's their responsibility as a search engine to give authors tools to identify their work.

dazc · on Sept 18, 2018

You can get them crawl a new page in webmaster tools (or whatever name it has this week) and they will index with a few seconds - only if it's an established site though, doesn't work so well on new sites.

Of course, this doesn't prove your page has the original content though.

gingerlime · on Sept 18, 2018

I'm not sure even delaying things work unfortunately. At least if someone has strong ranking already, they can hijack your content[0]

[0] at least according to https://dejanseo.com.au/hijack/ (which according to the author is still a problem today, see https://news.ycombinator.com/item?id=17827589)

dazc · on Sept 18, 2018

Maybe a more subversive method would be publish the page with gibberish, initially, then republish the real content a day after?

d2wa · on Sept 18, 2018

I’ve actually done something like this in the past. I was able to identify the IP address of a few content scraping websites, and then made my RSS feed return a huge number of gibberish pages when requested from these IPs. I’m not sure whether they eventually ran out of database space or what happen, but these sites did go down after a few days and stayed down. https://www.ctrl.blog/entry/defacing-content-scrapers

(I didn’t attack these servers, by the by. They came to my servers and gobbled up all the auto-generated junk I served them all on their own.)

However, in recent years it has been more and more difficult to identify the right IP address as everything is hosted people are hosting behind Cloudflare or do the actual scraping from a short-term lease server with an unique IP.

contem · on Sept 18, 2018

Actually, shortly after I learned that Google switched to neural networks for search results, I noticed an old style of spam making a huge comeback: Either markov chain or neural network generated text, taking from related websites (I suspect they look at the top 10 websites for certain valuable terms).

This gibberish actually outranks legit content which refers to my content, sometimes even my own articles, especially when it is turned into a PDF.

Seems like it is easy to block ~250k webpages like:

https://www.google.com/search?q=inurl:?yhjhuyfib

but I guess Google keeps them there to keep the spammers in the dark? I hope so, else their new ranking signals allow for easier spam.

scrollaway · on Sept 18, 2018

> Seems like it is easy to block ~250k webpages like

I think it's unfair to look at these results and say "but it's so easy to block these". Google's time is best spent on solutions which will reliably and automatically block these, without going through fairly manual steps.

I did notice a decrease in quality in gmail's spam filter though. Increase in false positives and false negatives lately. I guess it's unlikely to be related...

contem · on Sept 18, 2018

To me it seems all the work of the same spammer(s). In such a case, do some manual intelligence and wrap it up. It won't scale to all forms of spam, but if a simple regex can uncover 250k+ results in 10 minutes, a manual spam fighter can still block millions of pages a day (and warn the webhost, remove these flakey ads from their networks, etc.).

No doubt the recent machine learning hype has given spammers more advanced tools to avoid detection.

scrollaway · on Sept 18, 2018

False positives are far more problematic than false negatives...

contem · on Sept 18, 2018

If you remove from index... sure. But for that URL that I posted, do you think there is even a single false positive in there?

metalliqaz · on Sept 18, 2018

That would probably work, but it would affect legitimate users. They would see the gibberish, too.

dazc · on Sept 18, 2018

You're right, no plan is perfect. I guess it might work if most readers are not going to the page as soon as it's first published though?

xte · on Sept 18, 2018

The problem is a matter of audience and missing communications IMO.

In the old usenet we do communicate DIRECTLY each others so anyone have a reputation, so we know "original sources" and we prize accordingly. In today's world being disconnected, on one side only content producer, on the over side only consumers, in the middle at maximum uncomfortable platforms that are limited and limiting user-user communication it's harder.

IMVHO the medicine is coming back to the communication era we lost in the past, no other systems proved to be effective, take a look a audio/video piracy as a good living example.

jwilk · on Sept 18, 2018

No, copying and distributing articles, even when illegal, is not theft.

matkins · on Sept 18, 2018

[flagged]

d2wa · on Sept 18, 2018

It actually makes a real difference. https://www.ctrl.blog/about/privacy-policy#privacy-policy-ad...

matkins · on Sept 18, 2018

First, DNT makes my browser more unique, which means I will be easier to track.

Second, I have no interest in seeing ads or having my data sent to third parties.

kgwxd · on Sept 18, 2018

Never heard of this blog before. I really like the looks of the content. My only method of subscribing to anything is RSS. I'm definitely not adding it to my list after reading this. I wasn't going to make them any money by unblocking trackers but that Amazon tip jar[1] thing looks really interesting, I've never seen it before, I likely would have used it.

[1]https://www.ctrl.blog/tip-jar

gbrown · on Sept 18, 2018

So a blog with content you like is verboten because your RSS feed wouldn't update instantly? That seems like an odd hill to die on.

kgwxd · on Sept 18, 2018

It's a horrible band-aid that hurts the functionality and reputation of RSS and shouldn't be encouraged by supporting the site setting a bad example. If they want to play the pointless cat-and-mouse game with the "thieves", fine, but crippling RSS in the process is not cool. They will just scrape the site instead and the RSS feed will be worse off for no gain.

gbrown · on Sept 18, 2018

Crippling? I'm as information addicted as the next guy, but unless I'm missing something fundamental about RSS, the only effect is that you as a user will notice new posts a few hours late. That seems more than reasonable to me.

If it turns out not to work, they can go back.

kgwxd · on Sept 19, 2018

And if it turns out to work, they will encourage others to follow, hurting an already neglected great technology.

toast0 · on Sept 18, 2018

I didn't look around at the rest of this blog, but I don't see anything here that's particularly time sensitive. I would have been just as interested in delaying RSS today or tomorrow.

kgwxd · on Sept 19, 2018

That's not the point, it's making RSS a second-class citizen. That's bad enough on its own. It's doubly bad in this case because they went from treating it as special. If it had just been made equal, I wouldn't object. This is clearly a problem best solved by the search engines, if the world fixes it with this technique, there will be no push to fix it the right way and RSS will be worse off than it already is.

d2wa · on Sept 19, 2018

https://www.ctrl.blog/topic/feeds