skrub-data/skrub

Add zero padding on embeddings column names for ordering purposes

Closed this issue · 6 comments

Problem Description

This issue is not super important, but it would help to have it fixed.

The output of MinHashEncoder and GapEncoder is a dataframe whose columns are in the form <col>_1, <col>_2, ..., <col>_10, <col>_11. While intuitive, this is slightly confusing when columns get sorted, e.g. with the AggJoiner, because then the ordering becomes <col>_1, <col>_10, <col>_11, <col>_2

Feature Description

I suggest we add left zero padding by accounting for the number of columns:

  • between 1 and 9 columns, no padding: 1, ..., 9
  • between 10 and 99 columns, a single 0 padding: 01, ..., 99
  • and so on

Alternative Solutions

Keep it as it is.

Additional Context

No response

@Vincent-Maladiere
I would like to work upon this issue

Hello @Shree7676 thank you, go ahead 😊

@Vincent-Maladiere
I am going through the code of GapEncoder and I am bit confused between columns and labels.
can you provide some code through which i can reproduce.
below is something I explored and the column name here doesn't provide a good picture of above issue.

import pandas as pd
from skrub import GapEncoder
data = pd.DataFrame({ 'category': ['apple', 'banana', 'apple', 'orange', 'banana', 'orange', 'banana'] })
encoder = GapEncoder(n_components=7, random_state=42)
encoded_data = encoder.fit_transform(data['category'])
encoded_df = pd.DataFrame(encoded_data)
print(encoded_df.columns.tolist())

it provides output as
['category: apple, orange, banana', 'category: banana, apple, orange', 'category: orange, apple, banana', 'category: apple, orange, banana (3)', 'category: apple, orange, banana (4)', 'category: apple, orange, banana (5)', 'category: apple, orange, banana (6)']

Yes, indeed, I may have written this issue too fast. This issue was specific to the MinHashEncoder, so I don't think there is a need for the GapEncoder. Sorry for that.

Thanks @Vincent-Maladiere
For all the guidance :)

There are a lot of other issues, feel free to pick another one. If you're not sure which one to pick, we can discuss it on Discord :)