Column types are not the same as the original data. Check categorical columns.
Ededu1984 opened this issue · 11 comments
Hey, I tried to implement a missing imputation process splitting the data into x_train, x_test and x_val. This error came up and I don't know exactly what it means.
I checked the data types of the columns of the three datasets( x_train, x_test and x_val), and they are the same.
AssertionError: Column types are not the same as the original data. Check categorical columns.
Pandas stores CategoricalDtypes as a list of categories - these categories must be the same. For all of the categorical columns in these datasets, can you verify that:
for col in x_train.select_dtype("category").columns:
x_train[col].dtype == x_test[col].dtype
x_test[col].dtype == x_val[col].dtype
They are ok. Any chance this is related to the encoding of the categorical variables?
Hmmm here is the relevant code block:
assert all(
[
self.working_data[col].dtype == new_data[col].dtype
for col in self.working_data.columns
]
), "Column types are not the same as the original data. Check categorical columns."
Where new_data
is the data being imputed and self.working_data
is the data stored in the kernel. Can you set up a loop to see which columns specifically are causing that to return false?
Should I include it after the error come up?
Run this code:
for col in kernel.working_data.columns:
print(f"{col} {kernel.working_data[col].dtype == new_data[col].dtype}")
change kernel to your kernel name, and new_data to the name of the new data
Hmmm if that code above is printing true for all columns then I don't see how this error can be occurring. The code above reproduces the check that impute_new_data
does, and new_data is not edited in any way before the check occurs.
I just realized one of the columns returned False. I'll try to run the code without this column
I ran the code without the columns that returned False and it worked as supposed.
Just as information, it has to do with the train_test_split. For some variables from the category group of the dataset, not all unique values from the categories are included in the test and validation set.
I am also running into this, and I think it would be great if miceforest could handle this situation automatically in a principled way that doesn't involve dropping an entire column.
Note that setting the categories of the training set column to include all categories from the test set column doesn't work either, because that runs into another check on line 361 ("has unused categories").