/linear_regression_spark

Run Linear Regression using Spark and Python on the Kaggle New York City Taxi Trip Dataset

Primary LanguagePython

This program performs linear regression using GD/SGD to optimize its parameters The best model is found by utilizing 10 cross validation that is then used on the "test" set. The parameters that are searched on are: - GD step size (alpha) - SGD batch size - regularization calculation or type - regularization coefficient

Steps to run program

  • Download files from kaggle at https://www.kaggle.com/c/nyc-taxi-trip-duration/data
  • Put files in data folder
  • Run process_input/preprocessing_add_row_ids
    • Adds row ids to file
  • Change training file to 1458645.csv (number of rows in file)
  • Update GD/SGD parameters in sgd_lr.py
  • Run . single_sgd_run.sh or call python sgd_lr.py
  • To run multiple sgd_lr.py against multiple training sizes, use run_sgd.sh
    • The list fns must be file names of files in the data folder (minus the csv extention)

Folder structure: Pseudo_code - holds project pseudo code report - contains powerpoint with conclusions and graphs results - training cross validaiton and test results src - python code and scripts to run linear regression src/not_used/ - miscellanous code that was not used for project src/process_input - python code to process training data test_input - small training sets to test code # not included for submission