joxeankoret/diaphora

Research ideas

joxeankoret opened this issue · 5 comments

Feel free to put your feature requests that require some research here (take a look to the below list to see what I mean):

  • Research Neural Machine Translation to match different architectures assemblers and pseudo-codes?
  • Find new ways to filter rows to prevent Cartesian products (ETOOBROAD).
  • Test a simple decision tree classifier to verify matches using an existing dataset (to prevent problems like the one with the Pigaios dataset).
  • Test adding an option to interactively train matches for specific projects.
  • Use Program Dependence Graphs (PDG) for matching?

Hope this makes sense as a feature.
Something I have noticed in pretty much every tool, plugin or similar meant for comparing binaries while also allowing for importing names and such is that they all for some reason fail to take in to consideration the relative location of the matched function to other functions.
This is especially noticeable when using it for ARM architecture ELF binaries where every single tool have just created a mess of the entire database and even after trying to solve it by trying to adjust settings for the diffing it does the same.
Unsure if this would fit as a comment here but a feature to set it to weigh heavily to the ordering of functions in database as well as being able to set it to not break Class location in the database if the naming scheme uses proper C++ mangling which mine does.
Example for the mangling in my databases "_ZN5Class8FunctionEP8Variablebfilv" which becomes "Class::Function(Variable*, bool, float, int, long, void)"

About the relative position for functions, in the currently in development version of Diaphora I'm using the concept of compilation units to try to workaround this problem. In case you are curious, compilation units boundaries are guessed using the old version of CodeCut's LFA (Local Function Affinity) algorithm and also IDA Magic Strings to extract and use (when it's available) the source code file names from debugging strings. Basically, if CodeCut says "there is a compilation unit from this address to this address", and then IDAMagicStrings says "these functions belongs to this source code", Diaphora will take the minimum address assigned by either IDA Magic Strings or CodeCut, as well as the maximum address found by any of these two methods, and create a single compilation unit for them. Compilation units are them, later on, used in Diaphora heuristics in various ways to favour matching functions in the same compilation unit instead of matching in random different areas. However, we don't always have meaningful enough information as to do this properly.

About mangled and unmangled names, Diaphora uses both and, if I remember correctly, it handles both cases properly, but I will put this review task in my to-do and verify at some point. The development of Diaphora 3.0 is still going on and it will take me quite some time yet to finish it, as I do it in my spare time (which is fine and fun).

CodeCut's LFA was new to me but IDA Magic Strings is something I am familiar with. Used it quite a lot to get an idea of what type libraries to create or if I should create a FLIRT signature using flair.

As for IDA Magic Strings determining what functions belong to what source code, is that separate from how it generates Candidate Function names?

The candidates it suggests have been quite hit and miss. Especially since it on several occasions presented candidates based on a single .rodata refererence (example from what I am looking at now it suggests "sinherit" (string in rodata="%*sinherit" ) and only giving a one word lower case candidate while the function also does another function call to a function named "CRYPTO_free" and one other rodata xref in the function named "CryptoX509v3V3". This lead me to assume the proper name for the Class could be CryptoX509v3 and function SubjectInheritance.
(EDIT: Found source of the function I looked at by searching "%*sinherit" at Google and it gave me the source file directly downloadable as first result(ASIdentifierChoice_inherit). )

In the other Issue i participated in that were about Diaphora freezing during certain heuristics I mentioned a plugin for creating signatures that I have had good success with and it was not the first one I tried but the only one that would produce proper wildcard signatures that were consistent not only between databases with recent version change but also several years in between and developer going from using old ndk-build for compiling and not stripping symbols in one version and matching pattern in future version where developer changed to current standard for compiling native C shared libraries on Android and stripped symbols.
In the plugin code there is mention of adjustments for ARM which seems to be one reason that "SigMaker" works.
My favorite feature it has though is the ability to generate patterns for xrefs. Makes keeping track of .bss variables a lot easier.
Link to the one I mentioned: IDA-Pro-SigMaker And it works from IDA Pro 7.7 if built using the IDA SDK for 7.7.

IDA Magic Strings in Diaphora is only used to get potential source file names, not function names. And, yes, it's normal that it fails a lot: it's using debugging strings after all.

  • Realized one thing that might be very useful, especially for let's say IDA Pro users with varying version of IDA Pro between users and how that might severely impact if they will be able to use Diaphora or not. I recently had a situation were I finally found a repository containing the perfect plugin for IDA Pro for the project I am working on.

  • There was one caveat. It was created targeting IDA Pro 8.2 while I sadly only have access to the old 7.7 (Technically I am retired due to chronic illness so me affording a new license won't happen unless I win some kind of lottery). But after working on it for some hours, back porting the code from using C++ 20 which was not really working well with the 7.7 IDA SDK I finally had rewritten part of the code and got it to work flawlessly. My first time delving in to C++ coding none the less.

  • My Idea would be, if possible and due to it opening up for more users with older versions or lower end systems, implementing IDA SDK as an option for those users to fetch the parts of Diaphora python that is heaviest to run and that they can using their IDA SDK build dynamic libraries that enable Diaphora to call the IDA api and the Hex-Ray decompiler api using those libraries while simultaneously at its core maintain how it works.

  • The IDA SDK build templates could be set up as branches or linked repo. Community will have to pitch in to look over the code and user will bring its own IDA SDK and Hex-Rays C++ header. Guidelines will be set for what part will be able to be built as libraries and what is required to use them. Furthermore this can also be extended to any expansion Diaphora does to other platforms and RE software and each will just as a joint community effort make sure the code is upheld.

  • My argument for this being a good approach is mainly that the difference in performance, reduced instances where IDA becomes unresponsive and over all speed is huge. Hope you will at least consider it. Although I do not have any other IDA Pro than 7.7. I do have IDA SDK + Hex-Ray SDK for 7.0, 7.2, 7.3, 7.5, 7.6, 7.7, 8.0 and 8.1 which I stumbled upon in an acquaintance repository one day. So I can very much try and help get that part going in the first place.