Abuse is an arms race, so I'm not certain it can entirely be "open source". Someone has to man the servers to adapt as the assault develops. There can be tools that are open source, but without a foundation that is running services, it's just a bunch of dead end code.
Absolutely. I agree the key is less code than rapidly updated data. Source networks, browser headers, client behavior, target links, content markers. Spammers may be awful, but some of them aren't stupid. IF they discover something isn't working, then they'll change up their approach. And it can't be an open service, because you're just helping the smart ones to hide better.
The main thing is training a model. Every time you mark something as spam or off topic, it trains a model so that it can identify similar things in the future. You can train your own model, but it will work better with a lot of data.
Open sourcing it would be hard, because you'd have to trust everyone else to classify things correctly, or you'd have to review every input yourself. A spammer could easily sneak in a lot of false positives or false negatives to throw the entire model off.
Why is spam score always weighted as a single outcome and never as a population-based result? I mean, you’re processing more data, I get it, but it feels like that would be a model much more resilient to tampering, bad actors, and just different social norms.
For example, if one set of users started rating a specific subset of posts as spam, then those users could be bucketed together into a “doesn’t want to see message type A” group while others, who minded other messages, would be bucketed into a “No B-messages” group.
This would need to be applied selectively, as it could easily result in an echo chamber for normal discourse, but I would’ve given my left arm to have that sort of filtering available in the game during my WoW days. Those city spammers were unbearable!
I, of course, would have fallen into the “I don’t care how great a deal your Thunderfury, Blessed Blade of the Windseeker, is, I’m just here to socialize” bucket.
> Open sourcing it would be hard, because you'd have to trust everyone else to classify things correctly, or you'd have to review every input yourself.
There's not a way you could choose which people/groups you trust (and don't) to classify spam correctly? Don't open source adblockers work like this?
Don't they use some reasonably reviewable methods, like regular expressions? And, more importantly, reviewable volumes? Also, an ad blocker can always have a debug mode, where it can show you a rule that removed the element.
With ad blocking you have none of that. Gigabytes of text go into training, kilobytes of inscrutable numbers go out. And all the debug info you get is how certain the computer sounded saying no.
Disqus has passable spam and toxicity detection (the latter via a third party) but many sites don't bother to make use of it. As a result a great many disqus comment sections are absolute cesspools of spam, bigotry, and threats of violence.
Talk about a useful open source chunk of software.