fox-it/flow.record

Rdump multi-timestamp functionality

Zawadidone opened this issue · 10 comments

I would like to speed up to processing of rmult-timestamp by allowing it to read JSON files or JSON input from standard input.

As a result the following functionality should be added:

  • Use target-query [...] -j to create JSON files or output JSON (already supported)
  • Use rmulti-timestamp to read JSON from standard input or from a file (currently not supported) and output record format (already supported) or JSON format (currently not supported)

I have a few questions to better understand how this could be added to the project:

  • Should this implementation be the same as how rdump works with reading and outputing?
  • What can of functionality is new and cannot be re-used from current code?

The idea behind this issue is that executing rmult-timestamp on records outputed by target-query is slow (more then 1 hour) mainly using the mft plugin. Which is not weird, because of the amount of records that a single process (rmulti-timestamp) has to process.

But with the following method I think that the usage of Dissect can be more scalable regardless of how much records a Dissect Target plugin outputs:

  1. Use xargs to execute target-query -j and output records in JSON format for every plugin on a target (already supported)
  2. Use xargs to execute split on the JSON files in amount of lines that rmulti-timestamp can process in a short amount of time (outside of Dissect project)
  3. Use xargs to execute rmulti-timestamp to read the JSON files, add new records for datetime fields and output records in JSON format (new functionality)
  4. Ingest the JSON files using Logstash to Elasticsearch (only applicable in my use case)

Or can the multi-timestamp feature be added to rdump?

Yes I think adding native multi-timestamp support to rdump is the best way instead of duplicating code. I'll be reviewing your PR.

As for speeding up and scaling, my thoughts on it:

  • Storing it as splitted files is the way to go, so you can process them with multiple processes or machines. (like you described)
  • I do prefer storing them in their original record format instead of JSON as you keep typing information and you can reprocess them at a later stadium. Of course, your use-case might vary as you mentioned.
    • You could theoretically also split record files into multiple files (currently not supported)

You mentioned slowness of mft and rmulti-timestamp, this can also be partly attributed to the fact that rmulti-timestamp is not very efficient because it keeps recreating RecordDescriptors on the fly. This is an expensive operation as it does a small hash calculation to generate the descriptor identifier. Caching descriptors could be a way to avoid this, but let's save that for another PR.

Also feel free to make the necessary adjustments to the mft plugin - we've not gotten around to it yet.

Okay I haven't dived in to the NTFS implementation, but splitting the plugin in two functions so that in half of the time the same amount of records can be produced speeds it up. If this is preferred I can create a PR for this.

For example:

  • function mft.std record: filesystem/ntfs/mft/std
  • function mft.filename record: filesystem/ntfs/mft/filename

@yunzheng Thanks for the quick review and merge

@Zawadidone no problem , thanks for your contribution!

I have no idea what the preferences are for the MFT plugin, i'll let @Schamper answer that :)

@Zawadidone I thought you meant to create individual records from every timestamp, instead of a single record with 4 timestamps. However splitting it up like that is also interesting, perhaps a --std-only flag?

@Schamper Both options would increase performance. A flag requires a hardcoded argument when using that function which is not convenient when running all functions in an automated manner.

I don't know what the preffered method is (see below a few examples). The first example increases performance because of the 4 seperate fuctions compared to the second example.

Split up timestamps (4 functions)

ntfs:
  mft:
    creation_time - Return the creation_time MFT records of all NTFS filesystems. (output: records)
    last_modification_time - [...]
    last_change_time - [...]
    last_access_time - [...]

Split up records (2 functions)

ntfs:
  mft:
    std - Return the std MFT records [...]
    filename - Return the filename MFT records [...]

Split up timestamps and records (8 functions)

ntfs:
  mft:
    std - Return the std MFT records [...]
      creation_time - Return the creation_time MFT records of all NTFS filesystems. (output: records)
      last_modification_time - [...]
      last_change_time - [...]
      last_access_time - [...]
    filename - Return the filename MFT records [...]
      creation_time - Return the creation_time MFT records of all NTFS filesystems. (output: records)
      last_modification_time - [...]
      last_change_time - [...]
      last_access_time - [...]

I agree, so a flag should only be used to deviate from the default, and the default should be changed to something sane.

I think splitting it up like that is perhaps a bit too much, not to mention that it would complicate the mft plugin quite a bit. I think in general you'd want all timestamps, so I believe that should be the default. So perhaps something like:

target-query -f mft  # individual records for every timestamp
target-query -f mft --grouped  # or --mft-grouped, to group the timestamps into one record
target-query -f mft --std-only  #  or --mft-std-only, standard information records only

In general it shouldn't matter if you give extra arguments, since they are only consumed if a plugin you're executing actually recognizes it.