This repo intiated from a copy of benchmark and sample code in Python for the Cause Effect Pairs Challenge, a machine learning challenged hosted by Kaggle and organized by ChaLearn.
Executing this requires Python 2.7 along with the following packages:
- pandas (tested with version 10.1)
- sklearn (tested with version 0.13)
- numpy (tested with version 1.6.2)
- scipy (tested with version 0.10.)
- ml_metrics
To run,
- Download the data
- Create three directories inside the repo directory: data, models, submissions
- Extract the kaggle data inside the “data” directory such that this is a valid path: data/CEfinal_train_text/CEfinal_train_pairs.csv
- Modify SETTINGS.json to point to the training and validation data on your system, as well as a place to save the trained model and a place to save the submission
- Now to train the classifier run "python train.py", it will save the model in models directory
- Otherwise, to cross-validate, run "python train.py -c 10" [10 fold cv]
- To try with a small subset of data, run "python train.py -n 100" [first 100 rows]
- Experiment with different classifiers in get_pipeline() function in train.py
- So, "python train.py -n 100 -c 3" means it will take first 100 rows and run a 3-fold cross-validation
- Make predictions on the validation set by running
python predict.py
[check the path] - Make a submission with the output file in submissions directory