Thompson Rivers University
COMP 4980 - Special Topics, Machine Learning
Bradley Crump
This project aims to show a student how to interact with data, formalize a plan for creating a predictive model and implement that model.
Much of the project structure derives from Abishek Thakur's book, Approaching Almost Any Machine Learning Problem.
The repo's code can be interacted with through the command line. The goal is to create an append-only, reusable codebase that can be applied to arbitrary use-cases. However, there are dependencies.
There is a requirements.txt file that to the best of my knowledge contains the requirements for training in this repository. The packages of note include:
- Scikit-learn
- imbalanced-learn
- joblib
- matplotlib
- numpy
A note: Run scripts from the root of the project
This repo supports training a model from the command line at the root of the project. There is also funcitonality to execute several scripts. Examples of these scripts can be found in the src/bash folder Training a model comes with several options. The general command structure for training a model:
python src/train.py \
--model <a model (key) from src/dispatchers/model_dispatcher> \
--fold <fold number (0 - 4 by default)> \
--datamanip <a manip (key) from src/dispatchers/data_manip_dispatcher> \
--grid <a grid (key) from src/dispatcher/grid_dispatcher> \
--normalize <a noramlize (key) from src/dispatcher/normalize_dispatcher> \
--subfeatures <defined in config as FEATURES>
A functional example:
python src/train.py \
--model rf \
--fold 0 \
--datamanip smote \
--grid grid \
--normalize standardize \
--subfeatures 1
model
is a scikit learn modelfold
is the stratified k fold that will be treated as the validation set
The following commands are optional
datamanip
is the sampling strategy usedgrid
which grid search to use. Will use the config.GRID_SEARCH_PARAMETERSnormalize
is used to scale the datasubfeatures
provides functionality for training a model on a subset of the features. Will use the config.FEATURES. If this command isn't used, all the features in the csv file in TRAINING_FILE will be used
The resulting model will be persisted to the /models folder. You can use the path to the model in the config.INFERENCE_MODEL to do other cool things in this repo like create a confusion matrix!
The repo uses a 'dispatching' framework that supports dynamic instantiation of different third party classes. The goal is to create an append only file that contains arbitrary complexity, however, there are drawbacks to this strategy, namely, the duplication of objects.
Currently, there are 4 dispatchers implemented.
- grid_dispatcher: dispatches different Grid Searching mechanisms from Sklearn
- data_manip_dispatcher: disptaches different sampling strategies from imblearn
- model_dispatcher: dispatches different Sklearn models
- normalize_dispatcher: dispatches scaling strategies from Sklearn
Confusion Matrix
Use confusion_matrix.py to produce a visualization of a confusion matrix.
Config Dependencies:
- INFERENCE_MODEL - model that will be used to create confusion matrix
- TRAINING_FILES - First file in the list will be used to retrieve the dataset
python src/conufusion_matrix.py --fold <fold-number>
Feature Importances
Use feature_importance.py to produce a visualizaiton of the importance of features.
Config Dependencies:
- INFERENCE_MODEL - model that will be used to create confusion matrix
- TRAINING_FILES - First file in the list will be used to retrieve the dataset
python src/feature_importance.py