davidsbatista/text-classification

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than

vikramkone opened this issue ยท 9 comments

Hi David
I'm seeing the following error, when I try to run your script on my test data.
I generated a csv file similar to your movies_generes.csv file where I have one text column and multiple label columns where the values are 1 or 0

Looks like the problem is with the "StratifiedSplit" method. But not sure what the issue is.
All the labels/columns have values of '0' or'1' values in more than 2 rows in the file

PS E:\Tools\TLC> python E:\Projects\NLP\TrainClassifiers.py --vectors tfidf --clf nb
C:\Python27\lib\site-packages\gensim\utils.py:862: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
Loading already processed training data
Traceback (most recent call last):
File "E:\Projects\NLP\TrainClassifiers.py", line 338, in
main()
File "E:\Projects\NLP\TrainClassifiers.py", line 245, in main
for train_index, test_index in stratified_split.split(data_x, data_y):
File "C:\Python27\lib\site-packages\sklearn\model_selection_split.py", line 1204, in split
for train, test in self._iter_indices(X, y, groups):
File "C:\Python27\lib\site-packages\sklearn\model_selection_split.py", line 1546, in _iter_indices
raise ValueError("The least populated class in y has only 1"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

I am having the same problem as @vikramkone can any suggest how i can solve it?

Try using x_train, x_test, y_train, y_test = train_test_split(data_x,data_y,test_size=0.33, random_state=42) . It should work

This because of the nature of stratification. The stratify parameter set it to split data in a way to allocate test_size amount of data to each class. In this case, you don't have sufficient class labels of one of your classes to keep the data splitting ratio equal to test_size.

This because of the nature of stratification. The stratify parameter set it to split data in a way to allocate test_size amount of data to each class. In this case, you don't have sufficient class labels of one of your classes to keep the data splitting ratio equal to test_size.

I confirm the above explanation. I have encountered this situation when dealing with a class that has a very low count . You can either take a random sample (not stratified) or try different test_size values, to be able to have an adequate size that could hold all your various labels.

It might be because you have a multi-label dataset. Which in this case you can use this tutorial from sklearn.

I too faced the same issue. I was trying to solve the spam text classification problem wherein mostly we have less number of spam messages. But on seeing the count of spam and ham messages, I found out that they were both equal in numbers, and without looking into the count I applied stratify = data['label'], I removed the stratify part and I issue was solved.

at has a very low count . You can either take a random sample (not stratified) or try different test_size values, to be able to have an adequate size that could hold all your various labels.

I think sklearn should handle such situations somehow automatically. It's frustrating and not clear immediately that it can be solved by slight fine-tuning of test_size.

How can we fix this? I think random_state would be any integer because it only take permutation seeds from it.

It might be because you have a multi-label dataset. Which in this case you can use this tutorial from sklearn.

Nope, my fake labels are 1,114 while real data labels are 475, now i i know this is the reason for ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2. @WajdiBenSaad is 101% correct. i am doing a binary classification problem