survey: binary search string results + api + usecases
m4b opened this issue · 3 comments
Prolegomena
Hullo!
So I've begun adding search functionality, and I already find it very useful. In particular, this is a usecase I find myself having:
- I want to search a binary for a string; this string could be a symbol I'm looking for, or an actual string in the binary, or it could be an import, or it could be referenced by a relocation.
- I want the search results to display these matches in a semantically meaningful manner
There are a number of issues at hand here. It's in the beginning stages, so I'd like to ask for everyone (anyone's) input about:
- what their common usecase for "grepping" a binary is
- how they'd like to see this information displayed
- how they'd like to present the search to the program
- what they'd expect to be output
Again, there's a lot going on here, so I'll open up with a particular example which addresses the uses I usually have, but I'd really like to know what other people want!
Grepping for a static string
I'm debugging/analyzing a binary. I want to see if "hello world" is somewhere in the binary. So I run:
bingrep merp -s "hello"
I want to know a few things:
- at what offset(s) the string occurred in the binary, if any
- how this offset would be normalized into the virtual address space of the section/program header/whatever
- where, semantically, could we interpret this offset and/or vm address w.r.t. what we know about the binaries sections/segments/etc?
It might look something like this:
Matches for "hello":
0x724
├──PT_LOAD(2) ∈ 0x724
├──.rodata(16) ∈ 0x724
Idx Name Type Flags Offset Addr Size Link Entsize Align
16 .rodata SHT_PROGBITS ALLOC 0x720 0x720 0x12 0x0 0x4
0x1707
├──.strtab(28) ∈ 0x9f
Idx Name Type Flags Offset Addr Size Link Entsize Align
28 .strtab SHT_STRTAB 0x1668 0x0 0x20a 0x0 0x1
Which is trying to say that hello was found at offset 0x724 in the binary; it is normalized to 0x724 in the PT_LOAD program header (for elf); to the .rodata section in the section headers, and here is that section header.
Similarly, it was also found in a strtab section header, which normalized is to the offset of 0x9f
starting from 0x1668
Grepping for a symbol
Similarly, suppose we're looking simply for whether puts
is called by our binary, and if so, what are the details of the symbol, and if possible, where is it called.
Perhaps using the same api, we search for:
bingrep binary -s puts
and this returns to us a couple of hits, which are semantically quite different:
Dyn Syms(8):
Addr Bind Type Symbol Size Section Other
0 GLOBAL FUNC puts 0x0 0x0
Plt Relocations:
201018 X86_64_JUMP_SLOT puts
Goal
What i'd like in both of these cases, if possible, is a unified api for querying the contents of a binary for a search string, and very importantly:
- an efficient, terse, but understable presentation of this information
I don't want it to be busy; I want with similar color coding techniques to highlight the information I need; and I want the output to be semantically relevant, e.g., the search string is used against symbol names in the symbol table, etc.
Ideally, this is presented finally to the user as some kind of tabular structure, or a summary of a group of tabular structures, each tailored to the semantic content the string matched against, perhaps in different categories, like:
raw string:
- [ offset, vmaddr, phdr ]
- [ offset, vmaddr, shdr ]
symbol:
- dynamic entry
- symtab entry
- debugging entry
- locations
etc., for any various number of different kinds of matches, and categories.
Implementation Details
I'm not a big text search aficionado, so if anyone wants to help with the actual search string api, e.g., regexes, case insensitive, etc., as well as efficiency concerns, that would be great - i'm all ears - or in the case of PRs, very grateful!
Conclusion
If you have a usecase, or an idea of how to present this information usefully, I'm interested in your feedback.
The master branch right now contains a very, very prototypical implementation invoked via:
bingrep <binary> -s "your string"
it is case sensitive, but also accidentally works with prefixes.
It currently dumps the regular print, then scans the binary, and pushes all matches, then normalizes the string against the program and section headers. I've started experimented with other "semantic" output, and there's definitely a lot of potential, hence this issue :)
Output is like:
Matches for "hello":
0x724
├──PT_LOAD(2) ∈ 0x724
├──.rodata(16) ∈ 0x724
0x1707
├──.strtab(28) ∈ 0x9f
I think showing the offset of the string by section is a nice feature and sets it apart from strings | grep
commands.
I'd also like to see searching executable code via hex strings (not any disassembly, that's probably beyond the scope of this project) with wild cards like bingrep <binary> -s "AA ?? CC"
matches AA 11 CC, AA BB CC, etc.
Just a thought for what could make bingrep even more useful. Oh, and counting matches, versus just finding offset would be nice too!
What would be nice, is if the search option also displays the context around a match.
This would be useful for finding spe
(e.g. full displayable string for the search string:
Searching for "chr" would show 'strrchr' and 'strchr' as match:
$ bingrep -s 'chr' /bin/ls
Matches for "chr":
0x1045
├──PT_LOAD(2) ∈ 0x401045
├──.dynstr(6) ∈ 0x401045
0x12fa
├──PT_LOAD(2) ∈ 0x4012fa
├──.dynstr(6) ∈ 0x4012fa
# First match is strrchr:
$ hexdump -C -s $((0x1045 - 25)) -n 50 /bin/ls
0000102c 72 74 6f 77 63 00 73 74 72 6e 63 6d 70 00 6f 70 |rtowc.strncmp.op|
0000103c 74 69 6e 64 00 73 74 72 72 63 68 72 00 66 66 6c |tind.strrchr.ffl|
0000104c 75 73 68 5f 75 6e 6c 6f 63 6b 65 64 00 64 63 67 |ush_unlocked.dcg|
0000105c 65 74 |et|
0000105e
# Second match is strchr:
$ hexdump -C -s $((0x12fa - 25)) -n 50 /bin/ls
000012e1 00 5f 5f 66 70 65 6e 64 69 6e 67 00 6c 6f 63 61 |.__fpending.loca|
000012f1 6c 74 69 6d 65 00 73 74 72 63 68 72 00 69 73 77 |ltime.strchr.isw|
00001301 63 6e 74 72 6c 00 6d 6b 74 69 6d 65 00 70 72 6f |cntrl.mktime.pro|
00001311 67 72 |gr|
00001313
$ bingrep /bin/ls | grep chr
0 GLOBAL FUNC strchr 0x0 0x0
0 GLOBAL FUNC strrchr 0x0 0x0
61a130 X86_64_JUMP_SLOT strchr
61a150 X86_64_JUMP_SLOT strrchr
In case of grepping for a word in a long string, it would be useful to get the full line of text.
$ strings /bin/ls | grep sort
sort_files
sort_type != sort_version
--sort
Sort entries alphabetically if none of -cftuvSUX nor --sort is specified.
-c with -lt: sort by, and show, ctime (time of last
with -l: show ctime and sort by name
otherwise: sort by ctime, newest first
-f do not sort, enable -aU, disable -ls --color
augment with a --sort option, but any
use of --sort=none (-U) disables grouping
-r, --reverse reverse order while sorting
-S sort by file size
--sort=WORD sort by WORD instead of name: none -U,
or status -c; use specified time as sort key
if --sort=time
-t sort by modification time, newest first
-u with -lt: sort by, and show, access time
with -l: show access time and sort by name
otherwise: sort by access time
-U do not sort; list entries in directory order
-v natural sort of (version) numbers within text
-X sort alphabetically by entry extension
u0079808@gbw-s-seq07:~$ bingrep -s 'sort' /bin/ls
Matches for "sort":
0x12c95
├──PT_LOAD(2) ∈ 0x412c95
├──.rodata(15) ∈ 0x412c95
0x1373f
├──PT_LOAD(2) ∈ 0x41373f
├──.rodata(15) ∈ 0x41373f
0x1374c
├──PT_LOAD(2) ∈ 0x41374c
├──.rodata(15) ∈ 0x41374c
0x1387e
├──PT_LOAD(2) ∈ 0x41387e
├──.rodata(15) ∈ 0x41387e
0x13e2c
├──PT_LOAD(2) ∈ 0x413e2c
├──.rodata(15) ∈ 0x413e2c
0x140ed
├──PT_LOAD(2) ∈ 0x4140ed
├──.rodata(15) ∈ 0x4140ed
0x14193
├──PT_LOAD(2) ∈ 0x414193
├──.rodata(15) ∈ 0x414193
0x141ca
├──PT_LOAD(2) ∈ 0x4141ca
├──.rodata(15) ∈ 0x4141ca
0x143bc
├──PT_LOAD(2) ∈ 0x4143bc
├──.rodata(15) ∈ 0x4143bc
0x1460d
├──PT_LOAD(2) ∈ 0x41460d
├──.rodata(15) ∈ 0x41460d
0x1464a
├──PT_LOAD(2) ∈ 0x41464a
├──.rodata(15) ∈ 0x41464a
0x14f89
├──PT_LOAD(2) ∈ 0x414f89
├──.rodata(15) ∈ 0x414f89
0x1503d
├──PT_LOAD(2) ∈ 0x41503d
├──.rodata(15) ∈ 0x41503d
0x15057
├──PT_LOAD(2) ∈ 0x415057
├──.rodata(15) ∈ 0x415057
0x1506c
├──PT_LOAD(2) ∈ 0x41506c
├──.rodata(15) ∈ 0x41506c
0x151b6
├──PT_LOAD(2) ∈ 0x4151b6
├──.rodata(15) ∈ 0x4151b6
0x151e1
├──PT_LOAD(2) ∈ 0x4151e1
├──.rodata(15) ∈ 0x4151e1
0x1540d
├──PT_LOAD(2) ∈ 0x41540d
├──.rodata(15) ∈ 0x41540d
0x154a7
├──PT_LOAD(2) ∈ 0x4154a7
├──.rodata(15) ∈ 0x4154a7
0x15503
├──PT_LOAD(2) ∈ 0x415503
├──.rodata(15) ∈ 0x415503
0x1553a
├──PT_LOAD(2) ∈ 0x41553a
├──.rodata(15) ∈ 0x41553a
0x15572
├──PT_LOAD(2) ∈ 0x415572
├──.rodata(15) ∈ 0x415572
0x155bd
├──PT_LOAD(2) ∈ 0x4155bd
├──.rodata(15) ∈ 0x4155bd
0x15698
├──PT_LOAD(2) ∈ 0x415698
├──.rodata(15) ∈ 0x415698
I had a usecase for the context displaying search option recently.
A tool I was trying to use used hard coded paths (relative to the HOME dir), but this string also appeared a lot in normal messages and text and symbol names.
As I wanted to change this hard coded path to a different path (to avoid that 2 versions of the program conflict with each other), I had to use trial and error to find the correct string to replace (strings program | grep word does not give offsets to easily do a specific replacement).
By using the regex crate, it should be possible to do case insensitive matching and searching for hex strings:
https://doc.rust-lang.org/regex/regex/index.html