According to the FSF there is a separation between data and code
Which of course is a complete denial of the reality. Code is data, and data is code. That duality is the crucial reason why general-purpose computers are so powerful. The only ones to profit from trying to make a distinction, as usual, are the lawyers and corporations behind them who seek to restrict instead of empower.
Especially in this era when decompilers are close to "perfect" (and can sometimes even be better than reading the original source code!), and with the rise of AI, IMHO the whole idea of "source code" being somehow more special than the executable binary is quickly losing relevance.
I have definitely read some teammates’ code that felt like it would be more readable doing a compiler-decompiler round-trip. Never actually did it, but I doubt it would be less readable than that seemingly intentionally obfuscated garbage.
Cant want for the jetbrains "deabstract" plugin, that compiles it, decompiles it and reconstructs a indirection free AST and then cleaner code from that AST via AI. De-Tech-Bro-My-Code. Pull the plug on all-the-patterns in one project devs and get cleaner code today.
There is a lot of decompiler research which isn't public.
A sibling comment mentions Hex-Rays and Ghidra. Those are only now slowly approaching the capabilities of what I've used.
The fact that the majority of code tends to not be intentionally obfuscated and is compiler-generated and thus easily pattern-matched also makes it quite straightforward. Of course the fact that decompilers are often used on code that is (e.g. malware, DRM) skews a lot of people's perceptions.
Just to be completely clear, the conditions I have been using Ghidra/Hex-Rays/BN with were not that bad. I wasn't analyzing malware or heavily-DRM'd software. Even with symbols and full debug info, many of those gripes still apply. (Hex-Rays is able to do a lot more with debug info. It can usually get a lot of the vtable indirections typed correctly, including with a bit of effort, multi-inheritance offset this pointers.)
I'd love to see this non-public decompiler research but I have some skepticism, as a lot of the information that is lost would require domain-specific reconstruction to get back to anywhere near full fidelity. I do not deny that you have seen impressive results that I have not, but I really do wonder if the results are as generalizable as you're making it sound. That sounds like quite a breakthrough that I don't think Ghidra or IDA are slowly approaching.
But since it's non-public, I suppose I'll just have to take you at your word. I'll be looking forward to it some day.
> Especially in this era when decompilers are close to "perfect" (and can sometimes even be better than reading the original source code!)
As someone who is knee-deep in a few hobby reverse engineering projects, I certainly wish this was the case :)
Hex-Rays and Ghidra both do a very commendable job, but when it comes to compiled languages, it is almost never better than reading the original source code. Even the easier parts of reversing C++ binaries still aren't fully automated; nothing that I'm aware of is going to automatically pull your vtables and start inferring class hierarchies.
Variable names are lost in executable code. When it comes to naming variables, most of the tools support working backwards from "known" API calls to infer decent function names, but only Binary Ninja offers a novel approach to providing variable names. They have an LLM service called Sidekick which offers suggestions to improve the analysis, including naming variables. Of course, it isn't very impressive if you were to just drop into a random function in a random binary where you have no annotations and no debug information.
Most of the "framework" stuff that compiles down, by some form of metaprogramming, is nearly non-sense and requires you to know the inner workings of the frameworks that you're touching. In my case I spend a lot of time on Win32 binaries, so the tricky things I see often are a result of libraries like MFC/ATL/WTL/etc. And I'll grant you that in some cases the original source code wouldn't exactly be the most scrutable thing in the world, but I'd still really rather have the MFC message handler mapping in its original form :) COM becomes a complete mess as its all vtable-indirected and there's just no good way for a decompiler to know which vtable(s) or (to some degree) the function signatures of the vtable slots, so you have to determine this by hand.
Vectorized code is also a nightmare. Even if the code was originally written using intrinsics, you are probably better off sticking to the graph view in the disassembly. Hex-Rays did improve this somewhat but last I checked it still struggled to actually get all the way through.
The truth is that the main benefit of the decompiler view in IDA/Ghidra/etc. is actually the control flow reconstruction. The control flow reconstruction makes it vastly easier to read than even the best graph view implementation, for me. And this, too, is not perfect. Switch statements that compile down to jump tables tend to be reconstructed correctly, but many switch statements decompile down to a binary tree of conditionals; this is the case a lot of the time for Win32 WndProc functions, presumably because the WM_* values are almost always too sparse to be efficient for a jump table. So I'd much rather have the original source code, even for that.
Of course it depends a bit on the target. C code on ELF platforms probably yields better results if I had to guess, due to the global offset table and lack of indirection in C code. Objective C is probably even better. And I know for a fact that Java and C# "decompiling" is basically full fidelity, since the bytecode is just a lot less far away from the source code. But in practice, I would say we're a number of major breakthroughs away from this statement in general not being a massive hyperbole.
(I'm not complaining either. Hex-Rays/Ghidra/BN/etc. are all amazing tools that I'm happy to have at my disposal. It's just... man. I wish. I really wish.)
Which of course is a complete denial of the reality. Code is data, and data is code. That duality is the crucial reason why general-purpose computers are so powerful. The only ones to profit from trying to make a distinction, as usual, are the lawyers and corporations behind them who seek to restrict instead of empower.
Especially in this era when decompilers are close to "perfect" (and can sometimes even be better than reading the original source code!), and with the rise of AI, IMHO the whole idea of "source code" being somehow more special than the executable binary is quickly losing relevance.