And you just know they'll gladly bill you for egress charges for their own bot traffic, too.
EDIT: Actually, this is an excellent question. By default, these bots would likely appear to come from "the internet" and thus be subject to egress charges for data transfers. Since all three major cloud providers also have significant interests in AI, wouldn't this be a sort of "silent" price increase, or a form of exploitive revenue pumping? There's nothing stopping Google, Microsoft/OpenAI, or Amazon from sending an army of bots against your sites, scraping the data, and then stiffing you with the charges for their own bots' traffic. Would be curious if anyone has read the T&Cs of their own rate cards closely enough to see if that's the case, or has proof in their billing metrics.
---
Original post continues below:
One topic of conversation I think worth having in light of this is why we still agree to charge for bandwidth consumed instead of bandwidth available, just as general industry practice. Bits are cheap in the grand scheme of things, even free, since all the associated costs are for the actual hardware infrastructure and human labor involved in setup and maintenance - the actual cost per bit transmitted is ridiculously small, infinitesimally so to be practical to bill.
It seems to me a better solution is to go back to charging for capacity instead of consumption, at least in an effort to reduce consumption charges for projects hosted. In the meantime, I'm 100% behind blocking entire ASNs and IP blocks from accessing websites or services in an effort to reduce abuse. I know a prior post about blocking the entirety of AWS ingress traffic got a high degree of skepticism and flack from the HN community about its utility, but now more than ever it seems highly relevant to those of us managing infrastructure.
Also, as an aside: all the more reason not to deploy SRV records for home-hosted services. I suspect these bots are just querying standard HTTP/S ports, and so my gut (but NOT data - I purposely don't collect analytics, even at home, so I have NO HARD EVIDENCE FOR THIS CLAIM) suggests that having nothing directly available on 80/443 will greatly limit potential scrapers.
And you just know they'll gladly bill you for egress charges for their own bot traffic, too.
EDIT: Actually, this is an excellent question. By default, these bots would likely appear to come from "the internet" and thus be subject to egress charges for data transfers. Since all three major cloud providers also have significant interests in AI, wouldn't this be a sort of "silent" price increase, or a form of exploitive revenue pumping? There's nothing stopping Google, Microsoft/OpenAI, or Amazon from sending an army of bots against your sites, scraping the data, and then stiffing you with the charges for their own bots' traffic. Would be curious if anyone has read the T&Cs of their own rate cards closely enough to see if that's the case, or has proof in their billing metrics.
---
Original post continues below:
One topic of conversation I think worth having in light of this is why we still agree to charge for bandwidth consumed instead of bandwidth available, just as general industry practice. Bits are cheap in the grand scheme of things, even free, since all the associated costs are for the actual hardware infrastructure and human labor involved in setup and maintenance - the actual cost per bit transmitted is ridiculously small, infinitesimally so to be practical to bill.
It seems to me a better solution is to go back to charging for capacity instead of consumption, at least in an effort to reduce consumption charges for projects hosted. In the meantime, I'm 100% behind blocking entire ASNs and IP blocks from accessing websites or services in an effort to reduce abuse. I know a prior post about blocking the entirety of AWS ingress traffic got a high degree of skepticism and flack from the HN community about its utility, but now more than ever it seems highly relevant to those of us managing infrastructure.
Also, as an aside: all the more reason not to deploy SRV records for home-hosted services. I suspect these bots are just querying standard HTTP/S ports, and so my gut (but NOT data - I purposely don't collect analytics, even at home, so I have NO HARD EVIDENCE FOR THIS CLAIM) suggests that having nothing directly available on 80/443 will greatly limit potential scrapers.