If you can identify text written with mixed glyphs just ban it outright. Normal users don't use text like this, the pure binary presence of such "homomorphic" text at all is probably a better signal for spam than whatever your neural net when running it after normalization.
I think that depends on the users. People copying and pasting bits of text that was in English or another common languageโ think documentation, code, news articles, tweets, etc.โ with a different character set could be problematic.
Also, ๐ฎโด๐โฏ ๐๐ ๐ ๐ marketed as "๐ฝ๐ ๐๐ฅ๐ค ๐๐ ๐ฃ ๐ค๐ ๐๐๐ ๐๐๐๐๐" would be โญ๐๐ฒ๐ค๐ฅ๐ฑ ๐ฒ๐ญ ๐ฆ๐ซ ๐ฑ๐ฅ๐ฆ๐ฐ. (math symbols) A user base with young people getting bounced or shadow banned for trying to express themselves or distinguish themselves from their peers would be like เฒ _เฒ (Kannada letter ttha)
I think targeting the language they're using is a better bet.
> pasting bits of text that was in English or another common language
If they use many (maybe three? four? or more) character sets in the same post, or different character sets in any single word, then that'd be highly suspicious?
Whilst still letting people copy paste from another language
Special case needed for the shoulder shrug with an Hirigana letter tsu I mean katakana tsu
I've noticed much more usage of alternative Unicode ranges for numbers/letters in email subjects lately to make marketing messages stand out, too (in addition to emoji of course), though I wouldn't necessarily mind banning that...
huh. For any specific purpose? Does it seem like they avoiding paying for recruiter accounts or something by evading algorithms designed to detect their activity, or is it just for the heck of it?
I know right? There are so many times when I've wanted to use something like box drawing unicode characters (cp437) to explain a complicated concept on hacker news, but alas I couldn't, due widespread computer fraud and abuse. How are we going to build a more inclusive internet that serves the interests all ALL people around the world, regardless of native language, if the bad guys are forcing administrators to ban unicode? (โฏยฐโกยฐ)โฏ๏ธต ฬฒโปฬฒโฬฒโป
They kinda do. Check out the shrug "emoji", table flip, and so forth. Then there's the meme of adding text above and below by abusing Unicode's "super" and "sub" modifications.
You could block it to only ever represent ASCII, but then you've knocked out the ability to expand internationally.