The rule must be very simple: any occurrence of `eval()` should be a BIG RED FLAG. It should be handled like a live bomb, which it is.
Then, any appearance of unprintable characters should also be flagged. There are rather few legitimate uses of some zero-width characters, like ZWJ in emoji composition. Ideally all such characters should be inserted as \xNNNN escape sequences, and not literal characters.
Simple lint rules would suffice for that, with zero AI involvement.
I think there’s debate (which I don’t want to participate in) over whether or not invisible characters have their uses in Unicode. But I hope we can all agree that invisible characters have no business in code, and banishing them is reasonable.
How is it an accessibility issue? HTML allows things like little gif files. I've done this myself when I wrote text that contained Egyptian hieroglyphs. It works just fine!
Then use words. Or tooltips (HTML supports that). I use tooltips on my web pages to support accessibility for screen readers. Unicode should not be attempting to badly reinvent HTML.
In our repos, we have some basic stuff like ruff that runs, and that includes a hard error on any Unicode characters. We mostly did this after some un-fun times when byte order marks somehow ended up in a file and it made something fail.
I have considered allowing a short list that does not include emojis, joining characters, and so on - basically just currency symbols, accent marks, and everything else you'd find in CP-1521 but never got around to it.
Automatic escaping sounds nice until you need to grep or diff across repos and get buried in opaque escapes that turn ordinary review into unreadable junk. Once that lands in a repo, even routine deps updates can turn into edge-case mismatch roulette.
Lint zero-width chars, sure. But if the actual sink is runtime string injection, banning eval is only half a fix because Function and friends still get you to the same bad place while the linter congratulates itself.
Yeah it would have been nice to end with "and here's a five-line shell script to check if your project is likely affected". But to their credit, they do have an open-source tool [1], I'm just not willing to install a big blob of JavaScript to look for vulns in my other big blobs of JavaScript
The grep approach catches zero-width joiners and BOM characters but misses what GlassWorm uses - variation selectors (U+FE00-FE0F and U+E0100-E01EF). Those don't show up in most regex patterns people reach for, and they're valid Unicode so editors don't flag them either.
ESLint won't catch it because variation selectors are legal characters - they're meant for glyph selection in CJK text and emoji. The issue is that GlassWorm uses thousands of them per line where legitimate use is 1-2. It's a density problem, not a character-class problem.
We ran into this while analyzing the waves at work and ended up building a scanner around it - counts variation selector clusters per line, matches the decoder pattern (codePointAt + the specific arithmetic GlassWorm uses) in a narrow window to cut false positives from minified code. Open-sourced it last week: https://github.com/afine-com/glassworm-hunter