Sequence alignment and binary diffing
KOLANICH opened this issue · 2 comments
My method of black-box reverse engineering of file formats almost always involves generation of similar files (i.e. with incrementally added records) and binary diffing them. I don't know any good tools for binary diffing (there are tools for binary patching, but they are different beasts, and there are tools for "binary" diffing, where "binary" means an executable format (PE/EFL/Mach-O) filled with machine code of a known arch, so CFG are matched and diffed).
So I usually do the following
hd
all the files- split the known stuff with whitespaces
- then diff
- then look at boundaries and try to identify more stuff to be splitted
- then figure out patterns and encode them in Kaitai
An example of such a preliminary analysis is https://github.com/kaitai-io/kaitai_struct_formats/pull/532/files (LTO was initially analysed the same way).
Unfortunately hexdump misses diffs because the data is fit into grid and the grid is diffed too and interferes.
So we need a binary diffing tool.
Diffing is just a sequence alignment problem, there exist lot of impls of sequence alignment, i.e. python standard library contains a one, so it is not a problem.
What is needeed is
- the mechanism to influence sequence alignment by specifying predefined boundaries
- the mechanism to store this kind of info
- the gui for it
- interoperability with Kaitai
For the "sequence alignment" do you think something like the Header Framer would help? Instead of breaking on patterns, it could probably be modified to break on parsed KSY sections.
I was supposed to add diffing to hobbits a long time ago. I will try to think of ways to make it happen nicely.
do you think something like the Header Framer would help
Probably, but only as a workaround.
From the diffing plugin I suppose that following workflows should work (though it is just a concept, completely untested, I have never used such an algo because I have never used sequence alignment libraries with tunable objective for reverse engineering, I only used primitive tools like diff
which are already available in distros)
- A user specifies the constraints, this way incorporating expert knowledge. The one of the most important constraints is block size and alignment (granularity). I.e. if a block is 4 bytes, then matches can be of size of multiple 4 bytes only and must be aligned, if a "match" is not aligned, it is not a match. This is needed because some formats have a clear structure that they are composed of words of n bytes. I.e. both lto format and the git commit graph format looks like they are made of
u4
s (in KS notation, in C++ oneuint32_t
), so everything that is unaligned is likely a false match. Also, larger the block - more efficient alignment is. These kind of constraints is specific to a frame, so is available to other plugins too. I.e. hex-editor can draw a visible grid and allow selection of bytes in the blocks of the needed size and alignment only (trying to select/hover any byte in a block should act as if the whole block was clicked/hovered). - A user specifies the areas within the file which are "atomic" - that cannot be separated by operations cutting the binaries (probably would require a "Cutter" abstraft base class for plugins). In the example of git format they are the hashes of commits. It should be possible to create "atoms" via other plugins, such as Kaitai, and bulk processing. If constraints are violated, it should do something.
- a user selects multiple frames (probably would require a special subsystem within core, because the selected frames would acquire additional metadata linking them to each other and to the operation, I guess we need a separate issue for such a subsystem)
- a user initiates the operation. 3. The frames got linked to each other and to the operation.
- Then the sequence alignment is done with an objective function heavily penalysing (exponentially of the size of an atom (because the probabilty of such an atom occuring randomly exponentially decreases assumming that all the symbols are uniformly distributed) ?) intersection of atom boundaries by and also somehow penalysing reordering of atoms.
- Common blocks are identified by sequence alignment in all the selected frames are added as new atoms, with the metadata relating them to the operation. The process is repeated untill convergence (not sure if there will be more than 1 iterations, but by adding atoms we change the objective function, so I am not sure that the next iteration must give the same result), when there is no more atoms is left.
- Widgets for the blocks are added. Likely the highlights, but we need the metadata and some additional GUI to make it clear they originate from the diff plugin.
- Then a user tries to do something with them. I.e. he can split some "atoms" into multiple "atoms", if he sees enough evidence for that.
- Then a user tries to encode the obtained knowledge into Kaitai Struct and YARA (probably we need a yara plugin too?). It should be possible to easily copy basic stuff like sizes of highlights and offsets of them with right click. Also extraction into own subframes should be possible. And maybe autogeneration of YARA templates from multiple similar highlight areas?