drivendataorg/box-plots-sklearn

multilabel_train_test_split in multilabel.py does not ensure min_count examples of each label appear in each split

Opened this issue · 0 comments

From Datacamp's "Machine Learning with the Experts: School Budgets" 2.Creating a simple first model -Setting up a train-test split in scikit-learn, the lesson text says

"Some labels don't occur very often, but we want to make sure that they appear in both the training and the test sets. We provide a function that will make sure at least min_count examples of each label appear in each split: multilabel_train_test_split"

From what i see from the source, only the test set has guarantee of min_count of each label, there is no such guarantee on the training set as described in the datacamp lesson text. Training set indices were simply the complement of test set indices with this line in def multilabel_train_test_split? train_set_mask = ~test_set_mask