I've been thinking on how to build a benchmark for this stuff for a while, and don't have a good idea other than LLM-as-judge (which quickly gets messy). I guess there's a reason why current neural decompilation attempts are all evaluated on "seemingly meaningless" benchmarks like "can it recompile without syntax error" or "functional equivalence of recompilation" etc.
Besides functional equivalence, a significant part of the value in neural decompilation is the symbol (function names, variable names, struct definition including member names) it recovered. So, if the LLM predicted "FindFirstFitContainer" for a function originally called "find_pool", is this correct? Wrong? 26.333% correct?