Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've been thinking on how to build a benchmark for this stuff for a while, and don't have a good idea other than LLM-as-judge (which quickly gets messy). I guess there's a reason why current neural decompilation attempts are all evaluated on "seemingly meaningless" benchmarks like "can it recompile without syntax error" or "functional equivalence of recompilation" etc.


Hmm, specifically when it comes to reverse engineering, you have the best benchmark ever - you can check the original code, no?


that requires LLM as judge


no it doesn't, you just diff against the real source code. probably something more fuzzy/continuous than actual diff, but still


Besides functional equivalence, a significant part of the value in neural decompilation is the symbol (function names, variable names, struct definition including member names) it recovered. So, if the LLM predicted "FindFirstFitContainer" for a function originally called "find_pool", is this correct? Wrong? 26.333% correct?


Proving that two pieces of code are equivalent sounds very hard (incomputable)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: