ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than

Question

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than

vikramkone opened this issue 7 years ago · 9 comments

Hi David
I'm seeing the following error, when I try to run your script on my test data.
I generated a csv file similar to your movies_generes.csv file where I have one text column and multiple label columns where the values are 1 or 0

Looks like the problem is with the "StratifiedSplit" method. But not sure what the issue is.
All the labels/columns have values of '0' or'1' values in more than 2 rows in the file

PS E:\Tools\TLC> python E:\Projects\NLP\TrainClassifiers.py --vectors tfidf --clf nb
C:\Python27\lib\site-packages\gensim\utils.py:862: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
Loading already processed training data
Traceback (most recent call last):
File "E:\Projects\NLP\TrainClassifiers.py", line 338, in
main()
File "E:\Projects\NLP\TrainClassifiers.py", line 245, in main
for train_index, test_index in stratified_split.split(data_x, data_y):
File "C:\Python27\lib\site-packages\sklearn\model_selection_split.py", line 1204, in split
for train, test in self._iter_indices(X, y, groups):
File "C:\Python27\lib\site-packages\sklearn\model_selection_split.py", line 1546, in _iter_indices
raise ValueError("The least populated class in y has only 1"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

Answer 1 · 2018-03-01T22:25:43.000Z

I am having the same problem as @vikramkone can any suggest how i can solve it?

Answer 2 · 2018-03-15T11:17:51.000Z

Try using x_train, x_test, y_train, y_test = train_test_split(data_x,data_y,test_size=0.33, random_state=42) . It should work

Answer 3 · 2019-07-10T14:44:32.000Z

This because of the nature of stratification. The stratify parameter set it to split data in a way to allocate test_size amount of data to each class. In this case, you don't have sufficient class labels of one of your classes to keep the data splitting ratio equal to test_size.

Answer 4 · 2019-09-11T15:07:50.000Z

This because of the nature of stratification. The stratify parameter set it to split data in a way to allocate test_size amount of data to each class. In this case, you don't have sufficient class labels of one of your classes to keep the data splitting ratio equal to test_size.

I confirm the above explanation. I have encountered this situation when dealing with a class that has a very low count . You can either take a random sample (not stratified) or try different test_size values, to be able to have an adequate size that could hold all your various labels.

Answer 5 · 2020-07-15T02:19:05.000Z

It might be because you have a multi-label dataset. Which in this case you can use this tutorial from sklearn.

Answer 6 · 2020-09-03T15:28:17.000Z

I too faced the same issue. I was trying to solve the spam text classification problem wherein mostly we have less number of spam messages. But on seeing the count of spam and ham messages, I found out that they were both equal in numbers, and without looking into the count I applied stratify = data['label'], I removed the stratify part and I issue was solved.

Answer 7 · 2021-12-30T13:30:27.000Z

at has a very low count . You can either take a random sample (not stratified) or try different test_size values, to be able to have an adequate size that could hold all your various labels.

I think sklearn should handle such situations somehow automatically. It's frustrating and not clear immediately that it can be solved by slight fine-tuning of test_size.

Answer 8 · 2022-08-03T16:26:26.000Z

How can we fix this? I think random_state would be any integer because it only take permutation seeds from it.

Answer 9 · 2023-01-30T21:35:42.000Z

It might be because you have a multi-label dataset. Which in this case you can use this tutorial from sklearn.

Nope, my fake labels are 1,114 while real data labels are 475, now i i know this is the reason for ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2. @WajdiBenSaad is 101% correct. i am doing a binary classification problem