Then I'll just add 3 million bots to the network (or just enough to have about 50%) and I can guarantee to win the A/B test against an honest client most of the time.
It's an arms race, but this is mostly a question of rate limiting account creation, assigning a trustworthiness score to different accounts, some network analysis to detect coordinated accounts, and having some trusted accounts (run by the project) that can help double check results. After an account loads poisoned data, you can detect this after the attack (user reported spam), and then block (or probably shadow ban) the malicious account.
You make it sound easy but companies have been trying to fight this stuff for ages.
You can buy a trustworthy residential IP for low cost, you can buy them in bulk in the thousands. All of them are real residential IPs from any ISP of your choosing in any country. You can rent Chrome browsers running over those IPs, directed via remote desktop and accessibility protocols (good luck banning that without running awful of anti-discrimination laws). You can do all that for under 1k$ a month for like 1 million clients.
My workplace has been at the other end of DDoS attacks directed by such services, best you can do is ban specific Chrome versions they use but that lasts until they update.
It's an uphill battle that you will loose in the long term if you rely on client trust.
In terms of spam injection (the concern from up thread) I don't think DDoS is relevant. If the core project manages asking clients to process URLs, they'd just IP ban any client that returns too many results. DDoS is a concern for other reasons though.
I think in this specific case, the spammer is on poor footing. The spammer wants to inject specific content, ideally many times. With double processing of URLs and the spammer controls 50% of the clients then there's a 50% chance that a simple diff would show the injected spam. The problem is that the spammer needs to do this many times, so their injection becomes statistically apparent. If the spammer can only inject a small number of messages before they are detected, then the cost per injected spam will be quite high. Long running spam campaigns could eventually be detected by content analysis, so the spammer also needs to rotate content.
Obviously you can play with the numbers, the attacker could try to control >>50% of the clients. The project could process URLs >2x. The project could re-process N% of URLs on trusted hardware, etc. It's not easy by any means, but you can tune the knobs to increase the cost for spammers.
> but this is mostly a question of rate limiting account creation, assigning a trustworthiness score to different accounts, some network analysis to detect coordinated accounts, and having some trusted accounts (run by the project) that can help double check results.
Then OP has to do things that don't scale: Review some pages and identify a subset that can be trusted. Then OP can compare their downloads to new accounts and mark the bots.
Then the botnet will just be honest for like a year before it abuses the network. Even better because now honest new clients can be kicked as they disagree with the bot majority. So now the network bleeds users.
Checking which account is honest isn't too hard, you detect that there is a "problematic mismatch" between two clients. So the project runs their own client to check. If one has an exact match, then you'd question the other.
There is a challenge for sites that serve different content based on GeoIP, A/B testing, dynamic content, etc. So some human review of the diff may help check for malice. If there's literally spam, human review would clearly detect this and that bot is distrusted.
Then I'll simply use more bots to get 80% of the network, then I can almost always win any disagreements and your "problematic mismatch" never triggers.
Plus I can now cause you to have to run your own crawler anyway and either slow progress or cost you a lot of money.