/mcs_kfold

mcs_kfold stands for "monte carlo stratified k fold". This library attempts to achieve equal distribution of discrete/categorical variables in all folds. The greatest advantage of this method is that it can be applied to multi-dimensional targets.

Primary LanguagePythonMIT LicenseMIT

mcs_kfold

mcs_kfold stands for "monte carlo stratified k fold". This library attempts to achieve equal distribution of discrete/categorical variables in all folds. Internally, the seed is changed and stratified k-fold trials are repeated to find the seed with the least entropy in the distribution of the specified variables. The greatest advantage of this method is that it can be applied to multi-dimensional targets.

Usage

from mcs_kfold import MCSKFold
mcskf = MCSKFold(n_splits=num_cv, shuffle_mc=True, max_iter=100)

for fold, (train_idx, valid_idx) in enumerate(
    mcskf.split(df=df, target_cols=["Survived", "Pclass", "Sex"])
):
    .
    .
    .

see also example for further information.

histograms shown below is generated with this library with Kaggle Titanic: Machine Learning from Disaster data. you can see here that three target variables are equally distributed over five folds.

fold 0

image

fold 1

image

fold 2

image

fold 3

image

fold 4

image

Install

pip

pip install mcs_kfold

Install newest version

git clone https://github.com/MasashiSode/mcs_kfold
cd mcs_kfold
pip install .

Develop

poetry install

Test

pytest