linear_regression_spark: A Python repository from sarmstr5

This program performs linear regression using GD/SGD to optimize its parameters The best model is found by utilizing 10 cross validation that is then used on the "test" set. The parameters that are searched on are: - GD step size (alpha) - SGD batch size - regularization calculation or type - regularization coefficient

Steps to run program

Download files from kaggle at https://www.kaggle.com/c/nyc-taxi-trip-duration/data
Put files in data folder
Run process_input/preprocessing_add_row_ids
- Adds row ids to file
Change training file to 1458645.csv (number of rows in file)
Update GD/SGD parameters in sgd_lr.py
Run . single_sgd_run.sh or call python sgd_lr.py
To run multiple sgd_lr.py against multiple training sizes, use run_sgd.sh
- The list fns must be file names of files in the data folder (minus the csv extention)

Folder structure: Pseudo_code - holds project pseudo code report - contains powerpoint with conclusions and graphs results - training cross validaiton and test results src - python code and scripts to run linear regression src/not_used/ - miscellanous code that was not used for project src/process_input - python code to process training data test_input - small training sets to test code # not included for submission

sarmstr5/linear_regression_spark