aws/sagemaker-scikit-learn-extension

Output size mismatch of multicolumntfidf vectorizer

Closed this issue · 1 comments

The function is concatenating multiple column vectors of same row to a single vector instead of keeping them separate. Example:
For input:
corpus = np.array( [ ["Cats eat rats.", "Rats are mammals."], ] )

we get output as:
[[0.57735027 0.57735027 0.57735027 0.57735027 0.57735027 0.57735027]]

and not separated into columns such as
[[0.57735027 0.57735027 0.57735027], [0.57735027 0.57735027 0.57735027]]

To elaborate, for multi-column tfidf we run a new tfidf vectorizer on each column. So if there are two columns in the input we should expect to see two columns in the output. If we concatenate all the column vectors in the output it will be inconsistent (each column has its own world of vocabulary).

Talked offline and clarified with @srinidhigoud that we are essentially treating each column as its own tfidf matrix. Inverse transform is not necessarily an issue in our use case and we're not losing any information as a part of a feature engineering step.