God... I literally just coded up three different honeypots for this exact problem on my site, https://golfcourse.wiki, because LLM scrapers are a constant problem. I added a stupid recaptcha to the sign up form after literally 10,000 fake users were created by bots, averaging about 50 per day, and I have to say, recaptcha was suprisingly cumbersome to set up.
It's awful and it was costing me non-trivial amounts of money just from the constant pinging at all hours, for thousands of pages that absolutely do not need to be scraped. Which is just insane, because I actively design robots.txt to direct the robots to the correct pages to scrape.
So far so good with the honeypots, but I'll probably be creating more and clamping down harder on robots.txt to simply whitelist instead of blacklist. I'm thinking of even throwing in a robots honeypot directly in sitemap.xml that should bait robots to visit when they're not following the robots.txt.
It's awful and it was costing me non-trivial amounts of money just from the constant pinging at all hours, for thousands of pages that absolutely do not need to be scraped. Which is just insane, because I actively design robots.txt to direct the robots to the correct pages to scrape.
So far so good with the honeypots, but I'll probably be creating more and clamping down harder on robots.txt to simply whitelist instead of blacklist. I'm thinking of even throwing in a robots honeypot directly in sitemap.xml that should bait robots to visit when they're not following the robots.txt.
It's really, really ridiculous.