> It tells bots what they should do while directly crawling your site. But if a search engine gets to a URL some other way—for example if it follows a link from somewhere outside your site—it will still index that page.
I must confess I've been sceptial of robots.txt for a very long time (if I want to stop bots I serve them HTTP 403 Forbidden using .htaccess or similar).
Be that as it may, it appears I'm also confused about what robots.txt does and doesn't do.
Assuming you're correct: let's say I run EvilBot which scrapes sites and want to scrape your site example.com, but your robots.txt only allows Googlebot and disallows everyone else. Am I really OK to:
1. scrape the SERPs from google.com which mention your site ("site:example.com")
then
2. using that list of URIs, use my EvilBot to scrape your site, without needing to touch or respect your robots.txt, since I got the list of URIs on your site from Google, not by scraping example.com directly?
Your step 1 is enough for URLs to be indexed. Even a well-behaved search engine does not need to visit your site to index a URL, including whatever anchor text pointed at it.
If the crawler does then visit your site, it will see your robots.txt and (if well-behaved) obey it and not crawl the contents of the page at that URL. But this does not mean it will remove the URL itself from its index.
Again: robots.txt is intended to control crawler behavior, not search index visibility.
Google's page is a pretty good overview of this distinction:
> Again: robots.txt is intended to control crawler behavior, not search index visibility.
I'm obviously not asking the question clearly, I'm wanting to stop bots from crawling (it's scraping that annoys me), not search engines from listing URIs.
If I want to completely stop a bot from crawling my site (in the sense of "retrieving my content"), won't robots.txt prevent that? Even in the case of the bot having obtained a valid list of my URIs but not the pages contents from a 3rd party source?
Lets say I email you a list of URIs on my site. My robots.txt forbids all crawlers. Are you allowed to give the list of URIs to your bot and retrieve the content?
I must confess I've been sceptial of robots.txt for a very long time (if I want to stop bots I serve them HTTP 403 Forbidden using .htaccess or similar).
Be that as it may, it appears I'm also confused about what robots.txt does and doesn't do.
Assuming you're correct: let's say I run EvilBot which scrapes sites and want to scrape your site example.com, but your robots.txt only allows Googlebot and disallows everyone else. Am I really OK to:
1. scrape the SERPs from google.com which mention your site ("site:example.com") then 2. using that list of URIs, use my EvilBot to scrape your site, without needing to touch or respect your robots.txt, since I got the list of URIs on your site from Google, not by scraping example.com directly?