Different percentage of samples for each label after using MultilabelStratifiedKFold

Question

Different percentage of samples for each label after using MultilabelStratifiedKFold

Closed this issue 4 years ago · 2 comments

Hi trent-b:

Thanks for this nice repository, hope you can reply these questions below:

def multi2single_labels(y):
    d = {}
    for yy in y:
        d[str(yy)] = d.get(str(yy), 0) + 1
    return d
yy = np.array([[0,0,0,0]]*318+[[1,0,0,0]]*264+[[0,0,1,0]]*58+[[0,1,0,1]]*51+\
              [[1,0,0,1]]*81+[[0,1,0,0]]*151+[[0,1,1,0]]*33+[[0,0,1,1]]*27+\
              [[0,0,0,1]]*54+[[0,1,1,1]]*21+[[1,1,0,0]]*11+[[1,1,0,1]]*7+[[1,0,1,0]]*2)
xx = np.zeros((yy.shape[0],))
kfold = MultilabelStratifiedKFold(n_splits=2, random_state=42, shuffle=True)
for idx_fold, (idx_train, idx_valid) in enumerate(kfold.split(xx, yy)):
    print(f'Now in {idx_fold}th fold')
    y_valid = yy[idx_valid]
    d_y = multi2single_labels(y_valid)
    print(f'labels of y: {d_y}')

Using the code (simplest 2 fold) above will get result:
Now in 0th fold
labels of y: {'[0 0 0 0]': 155, '[1 0 0 0]': 136, '[0 0 1 0]': 28, '[0 1 0 1]': 25, '[1 0 0 1]': 37, '[0 1 0 0]': 76, '[0 1 1 0]': 18, '[0 0 1 1]': 15, '[0 0 0 1]': 31, '[0 1 1 1]': 9, '[1 1 0 0]': 5, '[1 1 0 1]': 4}
Now in 1th fold
labels of y: {'[0 0 0 0]': 163, '[1 0 0 0]': 128, '[0 0 1 0]': 30, '[0 1 0 1]': 26, '[1 0 0 1]': 44, '[0 1 0 0]': 75, '[0 1 1 0]': 15, '[0 0 1 1]': 12, '[0 0 0 1]': 23, '[0 1 1 1]': 12, '[1 1 0 0]': 6, '[1 1 0 1]': 3, '[1 0 1 0]': 2}
Q1: Why is '[1 0 1 0]' not be 1 in both two fold but all in 1th fold?
Q2: Why is number of some label so differ in each fold? (e.g.'[0 0 0 0]', '[1 0 0 0]')

Thanks!

Answer 1 · 2021-03-27T02:46:12.000Z

Hi Lance0218,

Your questions are understandable. What you are observing is actually not unexpected though. The paper from Sechidis et al. (2011) discusses the pros and cons of a "labelset" approach versus their approach. I believe you are thinking more in terms of a "labelset" approach. The approach by Sechidis et al. considers the lowest sum of "1" labels summed across all target instances to determine which steps to take next. This is in contrast to the "labelset" approach which would look at your 4-element vectors and immediately put one [1 0 1 0] vector into the 0th fold and the other [1 0 1 0] vector into the 1th fold. It may help to take a look at this slide deck by one of the authors starting at Slide 9.

From a practical perspective, you can change the random_state to find a split that may be more suitable for you. I see that random_state=0 splits [1 0 1 0] between the two folds. I hope this helps.

Answer 2 · 2021-04-01T10:21:23.000Z

Hi trent-b:

Thank you for your reply, and I found that I can use "LabelEncoder" to do what I want easily, so this issue can be closed.
Thank you again for everything you’ve done!

Best,
Lance