Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Can you please write more details about "I block many AI crawlers from accessing code and photos"? The bots are trying to access your nextcloud instance? I'm also self hosting a few services, including nextcloud.


No, not nextcloud, it's the photos on my website. They are CC-BY-NC-ND-4.0 licensed, which genAI doesn't respect in any form.

I added these in nginx.conf:

    map $http_user_agent $blocked_user_agent {
        default 0;
        "~*AI2Bot" 1;
        "~*AI2Bot-Dolma" 1;
        "~*Amazonbot" 1;
        "~*anthropic-ai" 1;
        "~*anthropic.com" 1;
        "~*Applebot" 1;
        "~*Applebot-Extended" 1;
        "~*AwarioBot" 1;
        "~*AwarioRssBot" 1;
        "~*AwarioSmartBot" 1;
        "~*Bytespider" 1;
        "~*CCBot" 1;
        "~*ChatGPT-User" 1;
        "~*ClaudeBot" 1;
        "~*Claude-Web" 1;
        "~*cohere-ai" 1;
        "~*cohere-training-data-crawler" 1;
        "~*DataForSeoBot" 1;
        "~*Diffbot" 1;
        "~*DuckAssistBot" 1;
        "~*FacebookBot" 1;
        "~*FriendlyCrawler" 1;
        "~*Googlebot-Extended" 1;
        "~*Google-CloudVertexBot" 1;
        "~*Google-Extended" 1;
        "~*GoogleOther" 1;
        "~*GoogleOther-Image" 1;
        "~*GoogleOther-Video" 1;
        "~*GPTBot" 1;
        "~*iaskspider/2.0" 1;
        "~*ICC-Crawler" 1;
        "~*ImagesiftBot" 1;
        "~*img2dataset" 1;
        "~*ISSCyberRiskCrawler" 1;
        "~*Kangaroo Bot" 1;
        "~*Meltwater" 1;
        "~*Meta-ExternalAgent" 1;
        "~*Meta-ExternalFetcher" 1;
        "~*OAI-SearchBot" 1;
        "~*Omgili" 1;
        "~*Omgilibot" 1;
        "~*openai.com" 1;
        "~*PanguBot" 1;
        "~*peer39_crawler" 1;
        "~*PerplexityBot" 1;
        "~*PetalBot" 1;
        "~*Scrapy" 1;
        "~*Seekr" 1;
        "~*SemrushBot" 1;
        "~*SemrushBot-OCOB" 1;
        "~*Sentibot" 1;
        "~*Sidetrade indexer bot" 1;
        "~*Timpibot" 1;
        "~*TurnitinBot" 1;
        "~*VelenPublicWebCrawler" 1;
        "~*webmeup-crawler.com" 1;
        "~*Webzio-Extended" 1;
        "~*YouBot" 1;
      }
and then in each site's config:

      location / { 
        if ($blocked_user_agent) {
            access_log /var/log/nginx/blockedbot.log ncsa;
            return 401;      
        }

But it's far from perfect. For better results, https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blo... is probably better, but it was a tad too much for my needs.


Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: