/201912-51-Predicting-US-Crop-Performance

EECS-E6893 final project code repo (project id:201912-51)

Primary LanguageJavaScript

201912-51-Predicting-US-Crop-Performance

EECS-E6893 final project code repo (project id:201912-51)

Live Demo

Repo Structure

src/  
 |-- usheatmap/  
 |-- mllib/  
 |-- website/  
 |-- .tmp/

usheatmap: has all of the data preprocessing source code
mllib: has all of the Spark MLlib source code (notebook)
website: has all of the visualization source code
.tmp: has sample data needed by mod_test.py to exercise the usheatmap data preprocessing source code (unless data is downloaded from the FTP servers using the suppled scripted.)

Running the Spark MLlib on Dataproc

Use this command to start a Dataproc cluster, then load the Juniper notebook located in src/mllib/

gcloud beta dataproc clusters create final2 --optional-components=ANACONDA,JUPYTER --image-version=preview --enable-component-gateway --metadata 'PIP_PACKAGES=requests_oauthlib google-cloud-bigquery tweepy' --metadata gcs-connector-version=1.9.16 --metadata bigquery-connector-version=0.13.16 --project lofty-hearth-256117 --bucket bigdatabucketcolumbia --initialization-actions=gs://dataproc-initialization-actions/python/pip-install.sh,gs://dataproc-initialization-actions/connectors/connectors.sh --single-node

Running the data preprocessing code

To run the data preprocessing code, some setup is required. You must setup a Google account with Google BQ api enabled. You must then generate a Google Auth key (in JSON format) and point to it in mod_test.py. You must also create a dataset and set it accordingly in mod_test.py.

There are example, parsed files, stored in the /src/.tmp/ directory. These files can be ran locally using the mod_test.py script and setting the following parameters to false:

# mod_test.py
    GET_VH = False
    PARSE_VH = False
    #...
    GET_C5 = False     
    PARSE_C5 = False 

The mod_test.py script is designed to excercise each of the data preprocessing modules independently. Be sure to disable Google BQ storing if it is not being used OR set the appropriate parameters in mod_test.py.

More Information

http://www.ee.columbia.edu/~cylin/course/bigdata/projects/