results: define output format/schema
Opened this issue · 9 comments
to store and exchange results we'll need a new output schema, likely json
the UI will render this data (or parts of it, when they become available although this should be quick)
again, likely an array of objects (combining all other keys from the databases?) should work here
@mr-tz I have worked on adding new format to parse output json back to capa in the past [PR_#1396]. Can I look into this ?
Sounds great, please take a look and let's discuss if you have any questions or a design draft.
(combining all other keys from the databases?) should work here
Could you please shed some light on this one.
QS uses a bunch of embedded databases to provide context about strings. Things like prevalence, library, version, etc. So all the information from each database should be merged into records about each recovered string.
@williballenthin @mr-tz for further discussion and inputs, I have created a new PR #972 :)
Hi @ooprathamm, pulling the discussion to this issue.
On a higher design level we'll have to see how we want to deal with structure vs. tagged strings vs. other functionality. Ideally, we can decouple the storage and logic a bit. The current POC implementation is quite elegant but IMO combines multiple features potentially complication further work. On the other hand, we may keep the extraction logic and just change the resulting document.
In my head I currently have something like (based on some of your work, here, thanks!):
{
"strings": {
"static_strings": [
{
"string": {
"encoding": "ascii",
"slice": {
"range": {
"length": 40,
"offset": 77
}
},
"string": "!This program cannot be run in DOS mode."
},
"structure": "pe.header",
"tags": [
"#common"
]
},
{
"string": {
"encoding": "ascii",
"slice": {
"range": {
"length": 12,
"offset": 11644
}
},
"string": "VirtualQuery"
},
"structure": "import table",
"tags": [
"#winapi",
"#common"
]
}
]
}
}
And/or we add a meta section storing the optional layout (PE, ELF) of a file.
This may require further discussion and be a larger effort but I'd be curious to hear your thoughts.
Thanks for re-sparking this discussion @mr-tz.
I think things like: location, length, encoding, and content of the string is part of the definition of the (static) string and should be at the top level. Or under .string
exactly as @mr-tz proposes.
Other information, like: structure, tags, and prevalence are more like "context" - things we assess about the string beyond its definition. I suspect each database/algorithm can provide its own context and we haven't explored all of them yet. So maybe all this context gets grouped together in an extensible way.
File layout seems orthogonal to (static) strings and probably should be stored separately from the strings. A presentation layer could stitch together all the data and make it look pretty.
Thanks for the review @mr-tz @williballenthin
I agree the current poc restricts further work. Thanks for providing a view on the desired output structure.
I appreciate your detailed explanation. I agree that decoupling the storage and logic could provide us the basis for incorporating advanced features without overcomplicating as done by floss.
Given the points you've raised, I'm eager to incorporate your suggestions into the pull request.
location, length, encoding, and content of the string is part of the definition of the (static) string / top level or under .string
structure, tags, and prevalence are more like "context" - grouped together in an extensible way.
File layout should be stored separately from the strings.
An alternative representation could then look like this:
{
"strings": {
"static_strings": [
{
"id": 1
"encoding": "ascii",
"offset": 77,
"length": 40,
"string": "!This program cannot be run in DOS mode."
},
{
"id": 1337
"encoding": "ascii",
"offset": 11644,
"length": 12,
"string": "VirtualQuery"
},
{
"id": 9999
"encoding": "ascii",
"offset": 123456,
"length": 6,
"string": "unique"
},
]
"context":
{
1:
{
"structure": "pe.header",
"tags": [
"#common"
]
},
1337:
{
"structure": "import table",
"tags": [
"#winapi",
"#common"
]
}
# no 9999 entry
}
},
"file_layout": {
...
}
}