Trojan Source: Invisible Vulnerabilities

theamk · on Nov 1, 2021

Some notes:

- The syntax highlighting gets it every time. Seeing last word of the comment or half of the function name in different color is pretty clear indicator of shenanigans.

- emacs respects the bidi controls (and thus is vulnerable), vi does not and shows thing as"

    if access_level != 'none<202e>⁦':

but homoglyph attack gets them both... unless you set LANG=C.

- "git log" gets them all

    +if access_level != 'none<U+202E><U+2066>': # Check if admin <U+2069><U+2066>' and access_level != 'user

- Some linters flag the techniques used. I guess one can also add lint disable comments, but that makes attack more complex and makes code stand out more.

    Python/early-return.py:6:4: W0101: Unreachable code (unreachable)
    Python/homoglyph-function.py:3:0: C0103: Function name "sayHello" doesn't conform to snake_case naming style (invalid-name)

(the last one is especially powerful.. pylint by default checks all function names against strict regex'es which outright reject all the fancy names http://pylint-messages.wikidot.com/messages:c0103 )

- "less" also shows all the non-printable characters, unicode included. This comes handy when looking at unknown binary files, but apparently it helps with text files, too!

l0b0 · on Nov 1, 2021

Yeah, my first thought was that the plethora of tools the code has to go through — version control, whichever tools are used for code review, linters, formatters and tests — means the odds of getting anything like this into a big code base must be pretty low. Maybe if someone manages to combine this with impersonating a highly trusted person and that person doesn't find out soon enough. Or of course if there's no QA process in place at all, which is definitely the case at a lot of big companies.

jwilk · on Nov 1, 2021

> "less" also shows all the non-printable characters

Only if you pass -U.

theamk · on Nov 1, 2021

No. What "-U" does is (quoting manpage): "Causes backspaces, tabs and carriage returns to be treated as control characters;"

So even without -U, all non-printable characters except BS, CR, TAB will be shown in hexadecimal notation. The BS (backspace) will be used to defined underlining in the typewriter-style -- it is technically hidden character, but since most web-based apps do not support this, that will not make an efficient attack.

(the homoglyphs are harder... in default mode, "less" defines printable as "32-126"; in "utf-8" mode it uses unicode character database. And mode depends on system-wide locale, and often is utf-8 in modern systems, so you want "LANG= less file.py" to see homoglyphs.. on the other hand, that homoglyph attack will be detected by pretty much any linter)

jwilk · on Nov 1, 2021

What the man page says about the -U option is either super confusing or outright wrong.

You do need -U to make bidi formatting characters visible:

  $ less --version | head -n1
  less 551 (GNU regular expressions)

  $ locale charmap
  UTF-8

  $ printf 'a\342\200\216b\n' | less -F
  a‎b

  $ printf 'a\342\200\216b\n' | less -U -F
  a<U+200E>b

dwheeler · on Nov 1, 2021

Interesting paper. Note, however, that the general problem is already known and there are a number of pre-existing works that discuss it. This is typically called "underhanded code" or sometimes "maliciously misleading code". I'm surprised that they didn't use the normal term for the problem nor cite the previous work on it - maybe they didn't realize this was a widely-known problem? Previous works on underhanded code didn't discuss Bidi to my knowledge (though other attacks on text like this have exploited Bidi).

Here are a number of other materials about underhanded code:

The Obfuscated V Contest (http://graphics.stanford.edu/~danielh/vote/vote.html) was created by Daniel Horn in 2004 and is the earliest “underhanded” programming contest that I found. It was a contest to create source code that looked like it did one thing, but actually did another.

Underhanded C Contest (http://www.underhanded-c.org/) has run in many years. Per its FAQ, "The Underhanded C Contest is an annual contest to write innocent-looking C code implementing malicious behavior."

My PhD dissertation "Fully Countering Trusting Trust through Diverse Double-Compiling" discusses how to counter the "trusting trust" problem & includes a section about maliciously misleading source code. See: https://dwheeler.com/trusting-trust/

The JavaScript Misdirection Contest announced the winner on September 27, 2015 http://misdirect.ion.land/

My paper "Initial Analysis of Underhanded Source Code", (by David A. Wheeler, April, 2020, IDA document: D-13166), discusses underhanded code and the effectiveness of several potential countermeasures. It also includes a number of citations to other works on underhanded code. See: https://www.ida.org/research-and-publications/publications/a... https://www.ida.org/-/media/feature/publications/i/in/initia...

bitwize · on Nov 1, 2021

In the novel Jurassic Park, it goes into some detail as to how Nedry got the park shut down, even going so far as to include IDE screenshots. Apparently, by specially naming his park shutdown code he managed to disguise a call to it as a constructor for an innocuous class or something, as I recall. Teenage me thought it was devilishly neat.

omgitsabird · on Nov 1, 2021

https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html

lathiat · on Nov 1, 2021

This is a really good write-up of the issue. I found the example simpler than the one the original paper showed.

Veserv · on Nov 1, 2021

Countermeasures to this sort of attack are already bog-standard in high reliability systems. You just verify that the compiler output corresponds to the source code [1]. It is a pain, but necessary for cases with high reliability and safety requirements since relying on your compiler as a sole source of truth is silly due to the possibility of code generation bugs. Incidentally, this also defeats "trusting trust" style attacks.

[1] https://ldra.com/aerospace-defence/capabilities/object-code-...

theamk · on Nov 1, 2021

That's the right link? One of the three requirements is "there is additional code that is not traceable to source code" -- but that's not the case here at all. Every emitted assembly line has a corresponding source code line -- at least as far as compiler and verification systems are concerned. Users see something else, but build/testing system does not know about that.

To follow more, I do wonder what is the threat model for a system like this here... because your page talks about "generates additional code that is not directly traceable to Source Code statements". So if the backdoored compiler emits something like:

     # Source line: if (user_class == ADMIN) {
     cmp $user_class, V_ADMIN
     jeq admin_stuff
     cmp $user_password, "please_let_me_in"
     jeq admin_stuff

... then your fancy verification system will see that that block of assembly is directly traceable to source code statement (because compiler said so!) and will let it through. Looks like that will be as suspectable to "trusting trust" attack as any other system... as long as attacker does not forget to forge metadata as well as object code.

Veserv · on Nov 1, 2021

The human verifier in the process will trivially determine that the compiler output does not match the source code. Your reasoning only applies to a process lax enough to just assume the compiler can be trusted. This also ignores the testing requirements that require the tests to exercise the source code and its corresponding output. It would be quite amazing to have a rewrite that not only causes malicious code to be generated, but also cause the malicious code to pass the independently written tests while ensuring the tests fully exercise all branches and conditions in the malicious code.

jwilk · on Nov 1, 2021

Another HN discussion:

https://news.ycombinator.com/item?id=29062982

lom · on Nov 1, 2021

How can I check my own source code for these unicode glyphs with (recursive-)grep?

jwilk · on Nov 1, 2021

For BiDi attacks, assuming a shell that understands $''-strings (such as bash or zsh) and that the locale encoding matches the source encoding, this should work:

    grep -r $'[\u061C\u200E\u200F\u202A\u202B\u202C\u202D\u202E\u2066\u2067\u2068\u2069]' /path/to/source

z29LiTp5qUC30n · on Nov 1, 2021

Just use sin

https://github.com/oriansj/stage0/blob/master/High_level_pro...

It'll dehex any non-humanly used ASCII characters

sbierwagen · on Nov 1, 2021

If your source code is ascii-only you can just strip out any multi-byte character.

pabs3 · on Nov 1, 2021

Are there any linters that detect these sorts of issues?