ECS-171-Course-Project

Introduction

The code is entered through main.py. When run, it will create a DataSet_Builder object which essentially does the preprocessing on the data and can build a DataSet object which runs the machine learning methods. Results are saved to a json file for later viewing.

Runing the Code

Run the code using

python3 main.py <dataset-level_params>.json <grid_search_params>.json

For example, one could run:

python3 main.py singleparam.json singleannparam.json
To get the dataset-level param results, we used:

python3 main.py params1.json singleannparam.json
To get the grid search results, use

python3 main.py params.json ann_params.json

Note that in order to break up the results, we actually ran several different files called ann_params_*.json

These json files are constructed using parambuild.py (more notes below) and contain a list of dictionaries. The dictionaries contain the parameter values used to construct the DataSet and run the models.

Relevant Files

Here's a list of current files/folders that are relevant.

    .
    +-- dataset.py
    +-- datasetbuilder.py
    +-- main.py
    +-- *.json
    +-- gridsearch.ps1
    +-- scrape.py
    +-- parambuild.py
    +-- OLS.py
    +-- ANN.py
    +-- TSNN.py
    +-- _data
    |   +-- site*.pkl
    |   +-- (data.pkl)
    |   +-- info.txt
    |   +-- merged.csv
    |   +-- sitedict.py
    +-- _results
    |   +-- _dataset-level_params
    |   +-- _grid_searches

We will not cover a full list of attributes here, but more information about each of the code files can be found in the files.

dataset.py contains the DataSet class with these and other attributes:
- impute_inputs(): takes in a future date and makes an estimation of the "X" input matrix values for that date by averaging values from that day and surrounding days from previous years
- run_OLS(): runs the OLS functions and stores results
- run_ANN(): runs the ANN functions and stores results
- run_TSNN(): runs the TSNN functions and stores results
datasetbuilder.py: contains the DataSet_Builder class with these and other attributes:
- clean_df(): drops rows with NaN or -99.9 values
- format_date(): converts to cylindrical representation of dates
- use_rect_radius(): reduces number of sites by rectangular radius
- use_pca(): uses PCA to reduce number of features
- remove_outliers(): uses IsolationForest to remove outliers
- scale_data(): min-max scales data from 0 to 1
- build_dataset(): builds a DataSet object
main.py is the main function to run, uses a DataSet_Builder object to build a DataSet object, runs ML methods, and save results
*.json: several files used as inputs to main.py
gridsearch.ps1: the final code was run on Windows, so this is a Windows PowerShell script that basically just runs main.py with different command line arguments
scrape.py is a standalone script that scrapes the website for data
- gets each site from SITEDICT
- each year from 1980 to 2019
parambuild.py is a standalone script that uses lists and for loops to build a list of dictionaries of parameter combinations, both for dataset-level parameters and for neural net parameters used by the grid search. The json files it saves can be used as command line arguments for main.py.
OLS.py contains functions that compute the OLS and make a prediction
ANN.py contains functions that create a FFNN model, train it, and make a prediction
TSNN.py contains functions that train a time series recurrent neural net and make predictions for the next four weeks.
data folder
- site*.pkl: these are individual pickles for each site (all years)
- data.pkl: overall data pickle
  - only appears after constructing DataSet object
  - is over 100 MB, so it's in the gitignore
- info.txt: site info, including, number, name, latitude, longitude
- merged.csv: csv file of all data
  - currently ordered by date then site
  - note that the data is rather sparse
  - if you make changes, delete data.pkl (which will regenerate)
- sitedict.py: uses info.txt to build a dictionary SITEDICT of sites
results folder
- dataset-level_params folder: contains graphs and a json file of results for various combinations of dataset-level parameters, used to determine the optimal dataset-level parameters. The folder also contains an INFO.txt file with more information
- grid_search_params folder: contains multiple subfolders with graphs and result json files that make up the grid search. The folder contains an INFO.txt file that has more information.

brandon-lau0/ECS-171-Course-Project

ECS-171-Course-Project

Introduction

Runing the Code

Relevant Files