No, nor can we just do it by IP. The bots are MUCH more sophisticated than that. More often than not, it's a cooperating distributed net of hundreds of bots, coming from multiple AWS, Azure, and GCP addresses. So they can pop up anywhere, and that IP could wind up being a real customer next week. And they're only recognizable as a botnet with sophisticated logic looking at the gestalt of web logs.
We do use a 3rd party service to help with this - but that on its own is imposing a 5- to 6-digit annual expense on our business.
> Our annual revenue from the site would put us on the list of top 100 ecommerce sites
and you're sweating a 5- to 6- digit annual expense?
> all our pricing is custom as negotiated with each customer.
> there's a huge number of products (low millions) and many thousands of distinct catalogs
Surely the business model where every customer has individually-negotiated pricing model costs a whole lot to implement, further, it gives each customer plenty of incentive to attempt to learn what other customers are paying for the same products. Given the tiny costs of fighting bots, in comparison, your complaints in these threads here seem pretty ridiculous.
> More often than not, it's a cooperating distributed net of hundreds of bots, coming from multiple AWS, Azure, and GCP addresses.
those are only the low-effort/cheap ones, the more advanced scraping makes use of residential proxies (peoples' pwned home routers, or where they've installed shady VPN software on their PC that turns them into a proxy) to appear to come from legitimate residential last mile broadband netblocks belonging to comcast, verizon, etc.
google "residential proxies for sale" for the tip of an iceberg of a bunch of shady grey market shit.
There's a lot of metadata available for IPs, and that metadata can be used to aggregate clusters of IPs, and that in turn can be datamined for trending activity, which can be used to sift out abusive activity from normal browsing.
If you're dropping 6 figs annually on this and it's still frustrating, I'd be interested in talking with you. I built an abuse prediction system out of this approach for a small company a few years back, it worked well and it'd be cool to revisit the problem.
Yes. And if I could get the perpetrators to raise their hands so I could work out an API for them, it would be the path of least resistance. But they take great pains to be anonymous, although I know from circumstantial evidence that at least a good chunk of it is various competitors (or services acting on behalf of competitors) scraping price data.
IANAL, but I also wonder if, given that I'd be designing something specifically for competitors to query our prices in order to adjust their own prices, this would constitute some form of illegal collusion.
What seems to actually work is to identify the bots and instead of giving up your hand by blocking them, to quietly poison the data. Critically, it needs to be subtle enough that it's not immediately obvious the data is manipulated. It should look like a plausible response, only with some random changes.
It's in their interest. I've scraped a lot, and it's not easy to build a reliable process on. Why parse a human interface when there's an application interface available?
TLS fingerprinting is one of the ways minority browsers and OS setups get unfairly excluded. I have an intense hatred of Cloudflare for popularising that. Yes, there are ways around it, but I still don't think I should have to fight to use the user-agent I want.
I don't want to say tough cookies, but if OPs characterization isn't hyperbole("the lion's share of our operational costs are from needing to scale to handle the huge amount of bot traffic."), then it can be a situation where you have to choose between 1) cut off a huge chunk of bots, but upset a tiny percent of users, and improve the service for everyone else, or 2) simply not provide the service at all due to costs.
I don't think it's likely to cause issues if implemented properly. Realistically you can't really build a list of "good" TLS fingerprints because there are a lot of different browser/device combinations, so in my experience most sites usually just block "bad" ones known to belong to popular request libraries and such.