> It tells bots what they should do while directly crawling your site. But if a ...

snowwrestler · on June 27, 2022

Your step 1 is enough for URLs to be indexed. Even a well-behaved search engine does not need to visit your site to index a URL, including whatever anchor text pointed at it.

If the crawler does then visit your site, it will see your robots.txt and (if well-behaved) obey it and not crawl the contents of the page at that URL. But this does not mean it will remove the URL itself from its index.

Again: robots.txt is intended to control crawler behavior, not search index visibility.

Google's page is a pretty good overview of this distinction:

https://developers.google.com/search/docs/advanced/robots/in...

logifail · on June 28, 2022

> Again: robots.txt is intended to control crawler behavior, not search index visibility.

I'm obviously not asking the question clearly, I'm wanting to stop bots from crawling (it's scraping that annoys me), not search engines from listing URIs.

If I want to completely stop a bot from crawling my site (in the sense of "retrieving my content"), won't robots.txt prevent that? Even in the case of the bot having obtained a valid list of my URIs but not the pages contents from a 3rd party source?

Lets say I email you a list of URIs on my site. My robots.txt forbids all crawlers. Are you allowed to give the list of URIs to your bot and retrieve the content?

snowwrestler · on June 28, 2022

You are correct: a bot that is well-behaved (follows robots.txt directions) will not crawl your site if your robots.txt forbids crawling.