MaxHalford/prince

mca: "ValueError: dimension mismatch"

cmougan opened this issue · 4 comments

After trainning a scikit learn pipeline with mca, I try to use it in the test set, see code below, and get the error of "ValueError: dimension mismatch", (see the full log further down)

import prince


mca = prince.MCA(
    n_components=20,
    n_iter=3,
    copy=True,
    check_input=True,
    engine="auto",
    random_state=42,

)
enet = ElasticNet()

pipe_mca = Pipeline(
    [("mca", mca), ("type", TypeSelector(np.number)), ("enet", enet)]
)

pipe_mca.fit(X_train[["Country", "FormalEducation"]],y_train);


Pipeline(pipe_mca.steps[:-1]).transform(X_train[["Country", "FormalEducation"]]).head()

print(
    "MAE in train set for MCA: ",
    mean_absolute_error(pipe_mca.predict(X_train[["Country", "FormalEducation"]]), y_train)
)

print(
    "MAE in test set for MCA: ",
    mean_absolute_error(pipe_mca.predict(X_test[["Country", "FormalEducation"]]), y_test)
)

I get the following error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-35-3f35a6c3613b> in <module>
      1 print(
      2     "MAE in test set for MCA: ",
----> 3     mean_absolute_error(pipe_mca.predict(X_test[["Country", "FormalEducation"]]), y_test)
      4 )

/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    114 
    115         # lambda, but not partial, allows help() to work with update_wrapper
--> 116         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    117         # update the docstring of the returned function
    118         update_wrapper(out, self.fn)

/opt/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
    417         Xt = X
    418         for _, name, transform in self._iter(with_final=False):
--> 419             Xt = transform.transform(Xt)
    420         return self.steps[-1][-1].predict(Xt, **predict_params)
    421 

/opt/anaconda3/lib/python3.7/site-packages/prince/mca.py in transform(self, X)
     48         if self.check_input:
     49             utils.check_array(X, dtype=[str, np.number])
---> 50         return self.row_coordinates(X)
     51 
     52     def plot_coordinates(self, X, ax=None, figsize=(6, 6), x_component=0, y_component=1,

/opt/anaconda3/lib/python3.7/site-packages/prince/mca.py in row_coordinates(self, X)
     36         if not isinstance(X, pd.DataFrame):
     37             X = pd.DataFrame(X)
---> 38         return super().row_coordinates(pd.get_dummies(X))
     39 
     40     def column_coordinates(self, X):

/opt/anaconda3/lib/python3.7/site-packages/prince/ca.py in row_coordinates(self, X)
    132 
    133         return pd.DataFrame(
--> 134             data=X @ sparse.diags(self.col_masses_.to_numpy() ** -0.5) @ self.V_.T,
    135             index=row_names
    136         )

/opt/anaconda3/lib/python3.7/site-packages/scipy/sparse/base.py in __rmatmul__(self, other)
    568             raise ValueError("Scalar operands are not allowed, "
    569                              "use '*' instead")
--> 570         return self.__rmul__(other)
    571 
    572     ####################

/opt/anaconda3/lib/python3.7/site-packages/scipy/sparse/base.py in __rmul__(self, other)
    552             except AttributeError:
    553                 tr = np.asarray(other).transpose()
--> 554             return (self.transpose() * tr).transpose()
    555 
    556     #####################################

/opt/anaconda3/lib/python3.7/site-packages/scipy/sparse/base.py in __mul__(self, other)
    518 
    519             if other.shape[0] != self.shape[1]:
--> 520                 raise ValueError('dimension mismatch')
    521 
    522             result = self._mul_multivector(np.asarray(other))

ValueError: dimension mismatch

I'm getting the same error when transforming a test dataset. Have you found a workaround?

I'm getting the same error when transforming a test dataset. Have you found a workaround?

Not really. After some testing I realized that I could only predict on train set. That the current function did not allowed to generalize.

I try some more encoders that gave me better results in train (and in test I cant say buy I expect) [https://contrib.scikit-learn.org/category_encoders/]

See if this helps:

#107 (comment)

Hello there 👋

I apologise for not answering earlier. I was not maintaining Prince anymore. However, I have just refactored the entire codebase. This refactoring should have fixed many bugs.

I don’t have time and energy to check if this fixes your issue, but there is a good chance it does. Feel free to reopen this issue if the problem persists after installing the new version — that is, version 0.8.0 and onwards.