Rdump multi-timestamp functionality
Zawadidone opened this issue · 10 comments
I would like to speed up to processing of rmult-timestamp
by allowing it to read JSON files or JSON input from standard input.
As a result the following functionality should be added:
- Use
target-query [...] -j
to create JSON files or output JSON (already supported) - Use
rmulti-timestamp
to read JSON from standard input or from a file (currently not supported) and output record format (already supported) or JSON format (currently not supported)
I have a few questions to better understand how this could be added to the project:
- Should this implementation be the same as how
rdump
works with reading and outputing? - What can of functionality is new and cannot be re-used from current code?
The idea behind this issue is that executing rmult-timestamp
on records outputed by target-query
is slow (more then 1 hour) mainly using the mft
plugin. Which is not weird, because of the amount of records that a single process (rmulti-timestamp
) has to process.
But with the following method I think that the usage of Dissect can be more scalable regardless of how much records a Dissect Target plugin outputs:
- Use
xargs
to executetarget-query -j
and output records in JSON format for every plugin on a target (already supported) - Use xargs to execute
split
on the JSON files in amount of lines thatrmulti-timestamp
can process in a short amount of time (outside of Dissect project) - Use
xargs
to executermulti-timestamp
to read the JSON files, add new records fordatetime
fields and output records in JSON format (new functionality) - Ingest the JSON files using Logstash to Elasticsearch (only applicable in my use case)
Or can the multi-timestamp feature be added to rdump?
Yes I think adding native multi-timestamp support to rdump is the best way instead of duplicating code. I'll be reviewing your PR.
As for speeding up and scaling, my thoughts on it:
- Storing it as splitted files is the way to go, so you can process them with multiple processes or machines. (like you described)
- I do prefer storing them in their original
record
format instead ofJSON
as you keep typing information and you can reprocess them at a later stadium. Of course, your use-case might vary as you mentioned.- You could theoretically also split
record
files into multiple files (currently not supported)
- You could theoretically also split
You mentioned slowness of mft
and rmulti-timestamp
, this can also be partly attributed to the fact that rmulti-timestamp
is not very efficient because it keeps recreating RecordDescriptors
on the fly. This is an expensive operation as it does a small hash calculation to generate the descriptor identifier. Caching descriptors could be a way to avoid this, but let's save that for another PR.
Also feel free to make the necessary adjustments to the mft
plugin - we've not gotten around to it yet.
Okay I haven't dived in to the NTFS implementation, but splitting the plugin in two functions so that in half of the time the same amount of records can be produced speeds it up. If this is preferred I can create a PR for this.
For example:
- function
mft.std
record:filesystem/ntfs/mft/std
- function
mft.filename
record:filesystem/ntfs/mft/filename
@yunzheng Thanks for the quick review and merge
@Zawadidone no problem , thanks for your contribution!
I have no idea what the preferences are for the MFT plugin, i'll let @Schamper answer that :)
@Zawadidone I thought you meant to create individual records from every timestamp, instead of a single record with 4 timestamps. However splitting it up like that is also interesting, perhaps a --std-only
flag?
@Schamper Both options would increase performance. A flag requires a hardcoded argument when using that function which is not convenient when running all functions in an automated manner.
I don't know what the preffered method is (see below a few examples). The first example increases performance because of the 4 seperate fuctions compared to the second example.
Split up timestamps (4 functions)
ntfs:
mft:
creation_time - Return the creation_time MFT records of all NTFS filesystems. (output: records)
last_modification_time - [...]
last_change_time - [...]
last_access_time - [...]
Split up records (2 functions)
ntfs:
mft:
std - Return the std MFT records [...]
filename - Return the filename MFT records [...]
Split up timestamps and records (8 functions)
ntfs:
mft:
std - Return the std MFT records [...]
creation_time - Return the creation_time MFT records of all NTFS filesystems. (output: records)
last_modification_time - [...]
last_change_time - [...]
last_access_time - [...]
filename - Return the filename MFT records [...]
creation_time - Return the creation_time MFT records of all NTFS filesystems. (output: records)
last_modification_time - [...]
last_change_time - [...]
last_access_time - [...]
I agree, so a flag should only be used to deviate from the default, and the default should be changed to something sane.
I think splitting it up like that is perhaps a bit too much, not to mention that it would complicate the mft
plugin quite a bit. I think in general you'd want all timestamps, so I believe that should be the default. So perhaps something like:
target-query -f mft # individual records for every timestamp
target-query -f mft --grouped # or --mft-grouped, to group the timestamps into one record
target-query -f mft --std-only # or --mft-std-only, standard information records only
In general it shouldn't matter if you give extra arguments, since they are only consumed if a plugin you're executing actually recognizes it.