- /Data/ processed RData files of datasets
- /Data/Raw Data Files/ raw csv files of datasets
- /Run/ RData files containing run results
- /Logs/ Log files for pipeline runs
- /R/ R code + Rnw-files to create reports
- /R/Packages/ R package directory (default)
To setup the pipeline on your machine:
-
Add a unique computer name (
comp_name
) and the path to your home directory (home_dir
) to the fileLocalVariables.sh
-
Run
SetupPipeline.sh
in Bash to install the required R packages (note that this requires that R is able to access the internet).
- Add a raw file to Data/Raw Data Files/. The data file should be named
{data_name}.csv
where{data_name}
is a unique name for the dataset. The file must adhere to some rough guidelines, namely:
- column names must be at the top row
- no duplicate columns
- no duplicate column names
- outcome variable must be in column 1
- outcome variable must be composed of all 0s and 1s
- all entries must be numeric
-
Run
R/DataProcessingScript.R
to create the{data_name}_processed.RData
file -
Run
Test/DataFileTest.R
to check that the dataset passes the tests
Once the data file is in the proper format, users can run the pipeline to produce classification models.
Each "call" to the pipeline runs multiple classification methods for a single dataset (data_name
) and a single weight on the positive examples (w_pos
).
Methods that are currently supported include:
- CART (cart)
- C5.0 Tree (c50_tree)
- C5.0 Rule (c50_rule)
- L1-Penalized Logistic Regression (lars_lasso)
- L2-Penalized Logistic Regression (lars_ridge)
- L1+L2-Penalized Logistic Regression (lars_elastic_net)
- Random Forests (randomforest)
- Stochastic Gradient Boosting (sgb)
- SVM Linear Kernel (svm_linear)
- SVM RBF (svm_rbf)
To run the pipeline:
-
Edit
RunPipeline.sh
with thedata_name
,fold_id
, range of weights, methods and free parameters for each method. -
Execute
RunPipeline.sh
in Bash
To Do
- Report Creation Script
- Fix ReadMe.txt with full instructions
- Add Case Weight Support to DataFileTest
- Check for Packages in StartUp.R
- Check Results Table in ProcessResults.R