Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That reminds me of an old tweet from someone to Matt Cutts (at least I think it was Cutts). MC had tweeted a reply to someone about how they don't allow scrapers and sites that just reuse content from other sites. Someone else replied to him with a screenshot of Google's search results page that had scraped and reused something from somewhere and said "I think I found one of those sites for you". (Hoping I remember that at least somewhat close to how it happened! It's a somewhat faded memory from several years ago)


Funny, I guess. But pages that scrape short excerpts from other sites and link back to them in order to provide additional value to a user absolutely are allowed by Google. Sites that scrape large portions of text and don't credit it back aren't allowed and would be illegal.

An example of an allowed page doing this would be a "recipe roundup" where a site lists 10 or so recipes in a particular theme (like Best Keto Breakfast Recipes) and each one links back to the original recipe with a short description.

I'm not surprised this sort of "gotcha" was popular on Twitter though (and is why I don't use Twitter)


Only, google rarely credits the sources they scraped for infoboxes.


For things that are considered facts they don't. But for Featured Snippets they do. If something is a widely recognized fact you don't need to credit a source in your own writing too


Google gets "facts" wrong pretty often because it gathers them by scraping. There are too many possible facts for them to have all been verified by a human.


Is some random celebrity's net worth a "widely recognized fact"?

https://theoutline.com/post/1399/how-google-ate-celebritynet...


You did not understand my comment - I said Featured Snippets are NOT necessarily facts and DO link to the source.

Featured Snippets drive a lot of traffic to sites and as a publisher I instruct my writers on how to craft their sentences to make it more likely we will get a Featured Snippet. Yes, Featured Snippets probably reduced the CTR for pages that were previously ranking #1 but that comes with an increase in user experience.

Part of the confusion here stems from the fact you used an ambiguous term like "infobox" that has no real meaning and I interpreted as the box that appears on the sidebar when Googling certain topics (like somebody's name). This typically pulls data from Wikipedia and Wikidata as FACTS.

So if by infobox you meant Featured Snippet then your comment was incorrect because Google does in fact always link back the original site (maybe in the last they sometimes didn't? I can't speak to that but I'm doubtful)

Also I'll just say this - the website in question in that article was relying on Google for all their traffic for a very particular type of query. That is just stupid. I'm a publisher and wouldn't be able to sleep at night if my site was like theirs.

Technically Net Worth is a fact and that's why Google was trying to put it in their Knowledge Graph. But since it's not typically public knowledge it becomes more of an opinion. Hence, it's a Featured Snippet with credit to the source. That site is just mad they don't always get the Featured Snippet so their traffic went down. They weren't complaining when Google was ranking them #1 were they?


> don't allow scrapers and sites that just reuse content from other sites.

I remember there used to be ranking penalties against scraper sites. I have not been following Google's SEO rules in a while, so does anyone know what happened to it?


It's essentially the 'duplicate content' penalty/penalties. [1]

It's pretty ironic really... take a look at this scenario...

Let's say we have Website A which we'll consider a "legitimate" website producing unique and quality content and we have Website B which is not, but it has some clever people running it's scrapers/submissions..

Developer responsible for Website B has identified that Website A produces quality content that it wants to steal - setups up a scraper to check for updates on Website A, when it finds them it will immediately scrape the content and repost it on Website B. Then they will update their sitemap and submit it to Google (or some other of the various ways to 'alert' google of a new page/updated content) which will trigger a crawl of that page/website.

If this crawl/index of Website B happens before Website A - who does Google see as the producer of that content and who is now the "duplicate"?

Whack-a-mole.

1: https://backlinko.com/hub/seo/duplicate-content


The top search result was Google scraping from wikipedia.


That did happen, as you describe it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: