Column types are not the same as the original data. Check categorical columns.

Question

Column types are not the same as the original data. Check categorical columns.

Ededu1984 opened this issue 2 years ago · 11 comments

Ededu1984 commented 2 years ago

Hey, I tried to implement a missing imputation process splitting the data into x_train, x_test and x_val. This error came up and I don't know exactly what it means.
I checked the data types of the columns of the three datasets( x_train, x_test and x_val), and they are the same.

AssertionError: Column types are not the same as the original data. Check categorical columns.

Answer 1 · 2022-12-12T15:22:00.000Z

Pandas stores CategoricalDtypes as a list of categories - these categories must be the same. For all of the categorical columns in these datasets, can you verify that:

for col in x_train.select_dtype("category").columns:
    x_train[col].dtype == x_test[col].dtype
    x_test[col].dtype == x_val[col].dtype

Answer 2 · 2022-12-12T17:24:07.000Z

They are ok. Any chance this is related to the encoding of the categorical variables?

Answer 3 · 2022-12-12T17:29:26.000Z

Hmmm here is the relevant code block:

            assert all(
                [
                    self.working_data[col].dtype == new_data[col].dtype
                    for col in self.working_data.columns
                ]
            ), "Column types are not the same as the original data. Check categorical columns."

Where new_data is the data being imputed and self.working_data is the data stored in the kernel. Can you set up a loop to see which columns specifically are causing that to return false?

Answer 4 · 2022-12-12T17:45:36.000Z

Should I include it after the error come up?

Answer 5 · 2022-12-12T18:14:39.000Z

Run this code:

for col in kernel.working_data.columns:
    print(f"{col}  {kernel.working_data[col].dtype == new_data[col].dtype}")

change kernel to your kernel name, and new_data to the name of the new data

Answer 6 · 2022-12-12T18:30:38.000Z

They are all true. This is the error when I try to impute new data

Answer 7 · 2022-12-12T18:37:06.000Z

Hmmm if that code above is printing true for all columns then I don't see how this error can be occurring. The code above reproduces the check that impute_new_data does, and new_data is not edited in any way before the check occurs.

Answer 8 · 2022-12-12T20:31:34.000Z

I just realized one of the columns returned False. I'll try to run the code without this column

Answer 9 · 2022-12-12T21:27:43.000Z

I ran the code without the columns that returned False and it worked as supposed.

Answer 10 · 2022-12-13T15:04:03.000Z

Just as information, it has to do with the train_test_split. For some variables from the category group of the dataset, not all unique values from the categories are included in the test and validation set.

Answer 11 · 2023-03-03T21:16:54.000Z

I am also running into this, and I think it would be great if miceforest could handle this situation automatically in a principled way that doesn't involve dropping an entire column.

Note that setting the categories of the training set column to include all categories from the test set column doesn't work either, because that runs into another check on line 361 ("has unused categories").