# June 11, 2014 Author: 
Sidney L. Bryson, Ph.D.
This code will generalize the PCMD parsing application developed by D. Supratim.  We will interatively add a general ability to extract any field(s) from PCMD and generate statistics.

The base code is different from any other PCMD analyzer in that it performs the statisical analysis using PySpark and Hadoop in a Lambda architecture.  Hence we expect when ported to a highly parallel enviornment that it will out perform any of the current implementation with respect to processing.
8/12/2014  READ FIRST
Current code in this folder still under development.
1. Use awk (gawk) script "simple" to see a sample parsing (#7 below)
2. Currently you can see a basic Spark parsing of PCMD in iPython. (#9 below)
3. Once you have iPython running execute pyspark_test_rev3 to see output

Next steps, now that I can select any field and value pair from PCMD, I will construct some basic code that generates statistics on a single chosen field collected.

++++++++++++++++++++

Files:
generate_windowed_cell_statistics.py
map_real_pcmd.py
map_synthetic_pcmd.py

config.ini

Instructions (Original)
To try the code, use the following steps:
(Note step 1 and step 2 need only be performed once in a machine lifetime)
1. hadoop fs -mkdir /SYN_PCMD_DATA
   hadoop fs -mkdir /REAL_PCMD_DATA


2. hadoop fs -copyFromLocal /vagrant/DATA/SYNTHETIC_PCMD_LIKE_DATA/*  /SYN_PCMD_DA
TA/.
   hadoop fs -copyFromLocal /vagrant/DATA/Pruned_Real_PCMD_Data/*  /REAL_PCMD_DATA/.



3. ~/start-spark-cluster.sh


4. To run using synthetic data: pyspark generate_cell_statistics.py  $SPARK_MASTER
  0 "hdfs://localhost:8020/SYN_PCMD_DATA/pcmd_2014-03-07-15-31-40"

5. To run using real data: pyspark gen_cell.py 2014-03-26.17_12 2
note: This version of the code takes the date and windowed time as input

6. View the output in file "output_cell_statistics.txt"

7. For debugging the following awk script and oneline pcmd file
gawk -f quick sample

8. Really to perform #4 for iPython you need to do the following:

9. IPYTHON_OPTS="notebook --pylab inline" IPYTHON=1 pyspark
#http://www.mccarroll.net/blog/pyspark2/index.html
10. pyspark_test_rev3 is the most current iPython file