survey: binary search string results + api + usecases

Question

survey: binary search string results + api + usecases

m4b opened this issue 7 years ago · 3 comments

Prolegomena

Hullo!

So I've begun adding search functionality, and I already find it very useful. In particular, this is a usecase I find myself having:

I want to search a binary for a string; this string could be a symbol I'm looking for, or an actual string in the binary, or it could be an import, or it could be referenced by a relocation.
I want the search results to display these matches in a semantically meaningful manner

There are a number of issues at hand here. It's in the beginning stages, so I'd like to ask for everyone (anyone's) input about:

what their common usecase for "grepping" a binary is
how they'd like to see this information displayed
how they'd like to present the search to the program
what they'd expect to be output

Again, there's a lot going on here, so I'll open up with a particular example which addresses the uses I usually have, but I'd really like to know what other people want!

Grepping for a static string

I'm debugging/analyzing a binary. I want to see if "hello world" is somewhere in the binary. So I run:

bingrep merp -s "hello"

I want to know a few things:

at what offset(s) the string occurred in the binary, if any
how this offset would be normalized into the virtual address space of the section/program header/whatever
where, semantically, could we interpret this offset and/or vm address w.r.t. what we know about the binaries sections/segments/etc?

It might look something like this:

Matches for "hello":
  0x724
  ├──PT_LOAD(2) ∈ 0x724
  ├──.rodata(16) ∈ 0x724
  Idx   Name              Type   Flags    Offset   Addr     Size    Link   Entsize   Align  
  16    .rodata   SHT_PROGBITS   ALLOC    0x720    0x720    0x12           0x0       0x4    
  0x1707
  ├──.strtab(28) ∈ 0x9f
  Idx   Name            Type   Flags   Offset    Addr   Size     Link   Entsize   Align  
  28    .strtab   SHT_STRTAB           0x1668    0x0    0x20a           0x0       0x1

Which is trying to say that hello was found at offset 0x724 in the binary; it is normalized to 0x724 in the PT_LOAD program header (for elf); to the .rodata section in the section headers, and here is that section header.

Similarly, it was also found in a strtab section header, which normalized is to the offset of 0x9f starting from 0x1668

Grepping for a symbol

Similarly, suppose we're looking simply for whether puts is called by our binary, and if so, what are the details of the symbol, and if possible, where is it called.

Perhaps using the same api, we search for:

bingrep binary -s puts

and this returns to us a couple of hits, which are semantically quite different:

Dyn Syms(8):
               Addr   Bind       Type        Symbol                        Size   Section     Other  
                 0    GLOBAL     FUNC        puts                          0x0                0x0    
Plt Relocations:
          201018 X86_64_JUMP_SLOT puts

Goal

What i'd like in both of these cases, if possible, is a unified api for querying the contents of a binary for a search string, and very importantly:

an efficient, terse, but understable presentation of this information

I don't want it to be busy; I want with similar color coding techniques to highlight the information I need; and I want the output to be semantically relevant, e.g., the search string is used against symbol names in the symbol table, etc.

Ideally, this is presented finally to the user as some kind of tabular structure, or a summary of a group of tabular structures, each tailored to the semantic content the string matched against, perhaps in different categories, like:

raw string:
 - [ offset, vmaddr, phdr ]
 - [ offset, vmaddr, shdr ]

symbol:
 - dynamic entry
 - symtab entry
 - debugging entry
 - locations

etc., for any various number of different kinds of matches, and categories.

Implementation Details

I'm not a big text search aficionado, so if anyone wants to help with the actual search string api, e.g., regexes, case insensitive, etc., as well as efficiency concerns, that would be great - i'm all ears - or in the case of PRs, very grateful!

Conclusion

If you have a usecase, or an idea of how to present this information usefully, I'm interested in your feedback.

The master branch right now contains a very, very prototypical implementation invoked via:

bingrep <binary> -s "your string"

it is case sensitive, but also accidentally works with prefixes.

It currently dumps the regular print, then scans the binary, and pushes all matches, then normalizes the string against the program and section headers. I've started experimented with other "semantic" output, and there's definitely a lot of potential, hence this issue :)

Output is like:

Matches for "hello":
  0x724
  ├──PT_LOAD(2) ∈ 0x724
  ├──.rodata(16) ∈ 0x724
  0x1707
  ├──.strtab(28) ∈ 0x9f

Answer 1 · 2017-07-17T21:20:42.000Z

I think showing the offset of the string by section is a nice feature and sets it apart from strings | grep commands.

I'd also like to see searching executable code via hex strings (not any disassembly, that's probably beyond the scope of this project) with wild cards like bingrep <binary> -s "AA ?? CC" matches AA 11 CC, AA BB CC, etc.

Just a thought for what could make bingrep even more useful. Oh, and counting matches, versus just finding offset would be nice too!

Answer 2 · 2018-03-15T14:03:54.000Z

What would be nice, is if the search option also displays the context around a match.
This would be useful for finding spe
(e.g. full displayable string for the search string:

Searching for "chr" would show 'strrchr' and 'strchr' as match:

$ bingrep -s 'chr' /bin/ls

Matches for "chr":
  0x1045
  ├──PT_LOAD(2) ∈ 0x401045
  ├──.dynstr(6) ∈ 0x401045
  0x12fa
  ├──PT_LOAD(2) ∈ 0x4012fa
  ├──.dynstr(6) ∈ 0x4012fa

# First match is strrchr:
$ hexdump -C -s $((0x1045 - 25)) -n 50 /bin/ls
0000102c  72 74 6f 77 63 00 73 74  72 6e 63 6d 70 00 6f 70  |rtowc.strncmp.op|
0000103c  74 69 6e 64 00 73 74 72  72 63 68 72 00 66 66 6c  |tind.strrchr.ffl|
0000104c  75 73 68 5f 75 6e 6c 6f  63 6b 65 64 00 64 63 67  |ush_unlocked.dcg|
0000105c  65 74                                             |et|
0000105e

# Second match is strchr:
$ hexdump -C -s $((0x12fa - 25)) -n 50 /bin/ls
000012e1  00 5f 5f 66 70 65 6e 64  69 6e 67 00 6c 6f 63 61  |.__fpending.loca|
000012f1  6c 74 69 6d 65 00 73 74  72 63 68 72 00 69 73 77  |ltime.strchr.isw|
00001301  63 6e 74 72 6c 00 6d 6b  74 69 6d 65 00 70 72 6f  |cntrl.mktime.pro|
00001311  67 72                                             |gr|
00001313

$ bingrep /bin/ls | grep chr 
                 0    GLOBAL     FUNC        strchr                          0x0                0x0    
                 0    GLOBAL     FUNC        strrchr                         0x0                0x0    
          61a130 X86_64_JUMP_SLOT strchr
          61a150 X86_64_JUMP_SLOT strrchr

In case of grepping for a word in a long string, it would be useful to get the full line of text.

$ strings /bin/ls | grep sort
sort_files
sort_type != sort_version
--sort
Sort entries alphabetically if none of -cftuvSUX nor --sort is specified.
  -c                         with -lt: sort by, and show, ctime (time of last
                               with -l: show ctime and sort by name
                               otherwise: sort by ctime, newest first
  -f                         do not sort, enable -aU, disable -ls --color
                               augment with a --sort option, but any
                               use of --sort=none (-U) disables grouping
  -r, --reverse              reverse order while sorting
  -S                         sort by file size
      --sort=WORD            sort by WORD instead of name: none -U,
                             or status -c; use specified time as sort key
                             if --sort=time
  -t                         sort by modification time, newest first
  -u                         with -lt: sort by, and show, access time
                               with -l: show access time and sort by name
                               otherwise: sort by access time
  -U                         do not sort; list entries in directory order
  -v                         natural sort of (version) numbers within text
  -X                         sort alphabetically by entry extension
u0079808@gbw-s-seq07:~$ bingrep -s 'sort' /bin/ls

Matches for "sort":
  0x12c95
  ├──PT_LOAD(2) ∈ 0x412c95
  ├──.rodata(15) ∈ 0x412c95
  0x1373f
  ├──PT_LOAD(2) ∈ 0x41373f
  ├──.rodata(15) ∈ 0x41373f
  0x1374c
  ├──PT_LOAD(2) ∈ 0x41374c
  ├──.rodata(15) ∈ 0x41374c
  0x1387e
  ├──PT_LOAD(2) ∈ 0x41387e
  ├──.rodata(15) ∈ 0x41387e
  0x13e2c
  ├──PT_LOAD(2) ∈ 0x413e2c
  ├──.rodata(15) ∈ 0x413e2c
  0x140ed
  ├──PT_LOAD(2) ∈ 0x4140ed
  ├──.rodata(15) ∈ 0x4140ed
  0x14193
  ├──PT_LOAD(2) ∈ 0x414193
  ├──.rodata(15) ∈ 0x414193
  0x141ca
  ├──PT_LOAD(2) ∈ 0x4141ca
  ├──.rodata(15) ∈ 0x4141ca
  0x143bc
  ├──PT_LOAD(2) ∈ 0x4143bc
  ├──.rodata(15) ∈ 0x4143bc
  0x1460d
  ├──PT_LOAD(2) ∈ 0x41460d
  ├──.rodata(15) ∈ 0x41460d
  0x1464a
  ├──PT_LOAD(2) ∈ 0x41464a
  ├──.rodata(15) ∈ 0x41464a
  0x14f89
  ├──PT_LOAD(2) ∈ 0x414f89
  ├──.rodata(15) ∈ 0x414f89
  0x1503d
  ├──PT_LOAD(2) ∈ 0x41503d
  ├──.rodata(15) ∈ 0x41503d
  0x15057
  ├──PT_LOAD(2) ∈ 0x415057
  ├──.rodata(15) ∈ 0x415057
  0x1506c
  ├──PT_LOAD(2) ∈ 0x41506c
  ├──.rodata(15) ∈ 0x41506c
  0x151b6
  ├──PT_LOAD(2) ∈ 0x4151b6
  ├──.rodata(15) ∈ 0x4151b6
  0x151e1
  ├──PT_LOAD(2) ∈ 0x4151e1
  ├──.rodata(15) ∈ 0x4151e1
  0x1540d
  ├──PT_LOAD(2) ∈ 0x41540d
  ├──.rodata(15) ∈ 0x41540d
  0x154a7
  ├──PT_LOAD(2) ∈ 0x4154a7
  ├──.rodata(15) ∈ 0x4154a7
  0x15503
  ├──PT_LOAD(2) ∈ 0x415503
  ├──.rodata(15) ∈ 0x415503
  0x1553a
  ├──PT_LOAD(2) ∈ 0x41553a
  ├──.rodata(15) ∈ 0x41553a
  0x15572
  ├──PT_LOAD(2) ∈ 0x415572
  ├──.rodata(15) ∈ 0x415572
  0x155bd
  ├──PT_LOAD(2) ∈ 0x4155bd
  ├──.rodata(15) ∈ 0x4155bd
  0x15698
  ├──PT_LOAD(2) ∈ 0x415698
  ├──.rodata(15) ∈ 0x415698

I had a usecase for the context displaying search option recently.
A tool I was trying to use used hard coded paths (relative to the HOME dir), but this string also appeared a lot in normal messages and text and symbol names.
As I wanted to change this hard coded path to a different path (to avoid that 2 versions of the program conflict with each other), I had to use trial and error to find the correct string to replace (strings program | grep word does not give offsets to easily do a specific replacement).

By using the regex crate, it should be possible to do case insensitive matching and searching for hex strings:
https://doc.rust-lang.org/regex/regex/index.html

Answer 3 · 2018-04-14T18:03:43.000Z

Hi @ghuls, sorry for the delay ! I like your idea, using regexp crate, for better searching. Unfortunately I won’t have the time to implement this, but if you felt inclined to submit a PR implementing this functionality I would be very likely to merge.