/node-firehose-s3-processor

Tool to process S3 dumps of JSON records made with AWS Kinesis Firehose

Primary LanguageJavaScript

Tool to process S3 dumps of JSON records made with AWS Kinesis Firehose

If you are using AWS Kinesis Firehose https://aws.amazon.com/kinesis/firehose/ to stream JSON records into some S3 bucket then this bucket contains some files in directories named based on the date of creation:

2016/04/01/00/aaa-2-2016-04-01-00-11-51-9c03f701-5a59-4b3e-9c49-b34c1c84b68c.gz
2016/04/01/04/bbb-2-2016-04-01-04-57-10-d70d8ee6-7369-446e-8047-b3341f532be8.gz
2016/04/01/05/ccc-2-2016-04-01-05-12-11-b2902d59-5193-4e6a-ac3d-ff971dca5910.gz
...

where each file contains a sequence of JSON records:

{-1st json record-}{-2nd json record-}...{-Nth json record-}

You can use this tool to export some fields from all JSON records that satisfies certain condition as CSV file. For example if each JSON record has following format:

{
  "username": "David",
  "event": {
    "time": "2016/12/10 23:30",
    "name": "some event",
    ...
  },
  ...
}

then you can export CSV file containing username and time for each record with event name equal to "login". Call:

node index.js -f 2016-04-01 -t 2016-04-01 -F username,event.time -C event.name=login some-bucket

to get:

username;event.time
David;2016/12/10 23:30
John;2016/12/10 23:45
...

Installation

Install npm dependencies:

npm install

Usage

  Usage: index [options] <bucket>

  Options:

    -h, --help                 output usage information
    -f, --date-from [string]   From date [2016-04-01]
    -t, --date-to [string]     To date [2016-04-02]
    -F, --fields [string]      Fields to be extracted [name,date,search.query]
    -C, --conditions [string]  Conditions [name=John,gl=uk]
    -p, --pool [integer]       Number of parallel files processed [20]
    -o, --out [string]         Output file name [output.csv]

Options --date-from, --date-to, --fields are required.