Whatapalaver/python_parser

Jupyter Notebook

Python Parser

Run the Parser

With a single file output for use as histogram input: python parser/parse.py -j data_all/ -o dumps/
For multi-file output: python parser/parse.py -j data_all/ -o dumps/ -m

Run the Analysis Plotter

Histograms triggered with -c flag:

Run all available fields with defaults set: python analysis/hist.py -c
Specific field histogram with max y-axis = 750 and bucket size of 20: python analysis/hist.py -f physicalDescription -m 750 -b 20 -c
To generate a full scale insetted subplot for the title field python analysis/hist.py -f title -m 250 -b 10 -i dumps/output.csv -s -c

Baar chart triggered with -x flag:

Generate a chart showing field population with python analysis/hist.py -x

TODO

Parser

Refactor
Better handling of extract objects and csv writing loop. Use extract_objects(args) again

Parser Input

Currently deals with json_input only, amend to accept API calls as well as files
Uses os.walk to process files in directory - amend to allow single file processing
Want to amend output option to single or multiple files

Parser Data Processing

currently only working with the flat data - need to prepare the nested data extraction

Parser Output

Make sure the process overwrites output file on the initial run
Need to sort header generation for single file parser - hard coded currently

Histogram generator

Refactor

Hist Input

implement argparse
specify input file
allow override of bin size
allow override of max axis
allow to process single field type
Pandas to deal with multiple files. Currently have a section of parser that outputs individual named processed files, this is commented out as I haven't fathomed how to get pandas to process multiple inputs