Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yeah, then someone has to create or find that whole table and make.

The initial problem wasn't those symbols but the content itself, the symbols and special characters came into the problem later.

Later on as mentioned in my original comment, that they would use positive content from other blog posts that were published/passed the moderation to mix up their bad content.

Probably could use a different method, but at that time needed something quick and fast and it worked and still works with very little tweaking.

Although we don't have massive amount of threats or abusers anymore to exactly know the effect, but again, so far it works.

That time, they would coming several thousands per minute, IP blocking, range blocking, USER AGENT, captcha or anything such didn't work on them.



The good news is that the Unicode consortium has a report on this issue, and the tables already exist for normalization and mapping of confusables to their ASCII lookalikes: https://www.unicode.org/reports/tr39/


Oh, that's nice.

I guess I can use that next time time to work on the data cleaning for that model.

Thanks.


As I commented above:

I built a Python library for finding strings obfuscated this way. Was critical when moderating our telegram channel before an ICO. https://github.com/wanderingstan/Confusables E.g. "𝓗℮𝐥1೦" would match "Hello"




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: