Data loading issues while train
Closed this issue · 4 comments
Hey ,
[Note] : I have pandas dataframe contain 2 columns as ,
- Text
- Label
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data,
y_data ,
test_size = 0.2,
shuffle=False)
train () and fit()
methods are not working
here is a reference code
How to fix it?
Thanks
Hey @Practcdi,
TLDR
this is due to the case, that fit/train
requires a list of strings instead of a DataFrame. (See function documentation here)
Fix: pass x_train.values.tolist(), y_train
to clf.train()
Bit more insights on why it does not work:
Following the respective code lines (here):
x_train, y_train = list(x_train), list(y_train)
if len(x_train) != len(y_train):
raise ValueError("`x_train` and `y_train` must have the same length")
If you pass a dataframe to the variable x_train
of shape = (535544, 1)
casting this to a list will only return the column names.
Thus the check will compare the following:
if 1 != 535544:
raise ValueError("`x_train` and `y_train` must have the same length")
Hey @Practcdi,
TLDR
this is due to the case, that
fit/train
requires a list of strings instead of a DataFrame. (See function documentation here)Fix: pass
x_train.values.tolist(), y_train
toclf.train()
Bit more insights on why it does not work:
Following the respective code lines (here):
x_train, y_train = list(x_train), list(y_train) if len(x_train) != len(y_train): raise ValueError("`x_train` and `y_train` must have the same length")If you pass a dataframe to the variable
x_train
ofshape = (535544, 1)
casting this to a list will only return the column names. Thus the check will compare the following:if 1 != 535544: raise ValueError("`x_train` and `y_train` must have the same length")
Thanks lot 😊
@Practcdi Thanks for sharing this issue with us!
@angrymeir Thanks for taking care of it 💪, btw, what do you think of adding an extra check at the beginning of fit/train
throwing an ValueError
exception saying something like "the x_train
argument is expected to be a list of strings" when the provided x_train
isn't a list of string. 🤔
@sergioburdisso Hm unsure about that one because...
- Where to start and where to end? Is it only
fit/train
that needs this kind of validation or also other methods (potentially all methods with user input because of consistency)? - I think it's difficult to detect if a
x_train
can be casted to a list of strings without information loss. E.g. whilepandas.DataFrame
can't be casted,pandas.Series
can be casted without issues, so it should stay a valid option? - Its well documented, stating exactly what the function expects.