Binary classification machine learning model(s) for loan default events
This repository was written and tested on a Debian Ubuntu Linux machine running PopOS
Setup (Editable)
# Using this repository as the current working directory
python3.7 -m venv .pyenv # Install Python 3.7 virtual environment
.pyenv/bin/pip install --upgrade pip wheel pip-tools bumpversion tox # Install additional tools
.pyenv/bin/pip install -e .# Install this application (in editable mode)
Run the unit tests
# Using this repository as the current working directory# Note: the dask scheduler does not print to stdout because it's running in a different process.
.pyenv/bin/tox
Generate the latest dependencies (to update requirements.txt)
# Using this repository as the current working directory
.pyenv/bin/pip-compile -vvv --upgrade --dry-run setup.py
Recommendation grid search parameter for model.py XGBoostModel.tune_parameters():
# Copy the assignment data to /tmp/
# cd to project directory
.pyenv/bin/ml101
Notes on client-side integration:
# Required API methods are provided in model.py XGBoostModel()
# Call order priority in python terminal:
xgboost_model = model.XGBoostModel()
xgboost_model.evaluate(X, y)
# Please feed an X test set that resembles the training set.
# Since data transformation is only applied during sampling, I will go ahead
# and apply the sampler to the client's data and treat it as if it were being split.
# But it will mean that the dimensions of the data will be chosen by the algorithm.
Notes on dask implementation:
# dask distributed pipes break when run-time is complete and cause warnings in tests.
# These bugs are due to dask having difficulty managing local threads.
Notes on general model framework:
# class 1: PCA to identify best variables
# class 2: k-fold cross-validator/imbalanced sampling
# class 3: Dask XGBoost
# class 4: generate accuracy scores
# - f1 (precision versus recall) - and confusion matrix
# - log-loss
# class 5: optimization using pca and sampling in a custom grid search framework