skrub-data/skrub

fetch_ken_types gives same results for many embedding_table_id's

Closed this issue · 2 comments

Describe the bug

Using the fetch_ken_types method with different embedding_table_id values often yields the exact same dataset. I'm not sure if this is intended.

Steps/Code to Reproduce

from skrub.datasets import fetch_ken_types

fetch_ken_types(embedding_table_id="companies")
fetch_ken_types(embedding_table_id="schools")
fetch_ken_types(embedding_table_id="movies")
fetch_ken_types(embedding_table_id="albums")
fetch_ken_types(embedding_table_id="games")
fetch_ken_types()

Expected Results

For embedding_table_id = "games", we seem to get the correct result:

>>> from skrub.datasets import fetch_ken_types
>>> fetch_ken_types(embedding_table_id="games")
                                                  Type
0                             wikicat_1994_video_games
1                                   wikicat_irem_games
2                          wikicat_ea_guingamp_players
3    wikicat_video_game_companies_of_the_united_kin...
4            wikicat_asian_games_medalists_in_swimming
..                                                 ...
636                     wikicat_college_football_games
637                           wikicat_sonic_team_games
638                    wikicat_space_opera_video_games
639             wikicat_boxers_at_the_2002_asian_games
640                     wikicat_motorcycle_video_games

[641 rows x 1 columns]

Actual Results

However, for other values of embedding_table_id, we get the same results:

>> fetch_ken_types(embedding_table_id="companies")
                                                     Type
0                      wikicat_italian_male_screenwriters
1       wikicat_21st-century_roman_catholic_archbishop...
2                      wikicat_2000s_romantic_drama_films
3                       wikicat_music_festivals_in_france
4             wikicat_20th-century_american_women_artists
...                                                   ...
114504                 wikicat_chicago_white_sox_managers
114505  wikicat_unincorporated_communities_in_new_orle...
114506        wikicat_men's_feldhockey_bundesliga_players
114507        wikicat_sports_clubs_disestablished_in_1951
114508           wikicat_magazines_disestablished_in_1950

[114509 rows x 1 columns]
>>> fetch_ken_types(embedding_table_id="schools")
                                                     Type
0                      wikicat_italian_male_screenwriters
1       wikicat_21st-century_roman_catholic_archbishop...
2                      wikicat_2000s_romantic_drama_films
3                       wikicat_music_festivals_in_france
4             wikicat_20th-century_american_women_artists
...                                                   ...
114504                 wikicat_chicago_white_sox_managers
114505  wikicat_unincorporated_communities_in_new_orle...
114506        wikicat_men's_feldhockey_bundesliga_players
114507        wikicat_sports_clubs_disestablished_in_1951
114508           wikicat_magazines_disestablished_in_1950

[114509 rows x 1 columns]
>>> fetch_ken_types(embedding_table_id="movies")
                                                     Type
0                      wikicat_italian_male_screenwriters
1       wikicat_21st-century_roman_catholic_archbishop...
2                      wikicat_2000s_romantic_drama_films
3                       wikicat_music_festivals_in_france
4             wikicat_20th-century_american_women_artists
...                                                   ...
114504                 wikicat_chicago_white_sox_managers
114505  wikicat_unincorporated_communities_in_new_orle...
114506        wikicat_men's_feldhockey_bundesliga_players
114507        wikicat_sports_clubs_disestablished_in_1951
114508           wikicat_magazines_disestablished_in_1950

[114509 rows x 1 columns]
>>> fetch_ken_types(embedding_table_id="albums")
                                                     Type
0                      wikicat_italian_male_screenwriters
1       wikicat_21st-century_roman_catholic_archbishop...
2                      wikicat_2000s_romantic_drama_films
3                       wikicat_music_festivals_in_france
4             wikicat_20th-century_american_women_artists
...                                                   ...
114504                 wikicat_chicago_white_sox_managers
114505  wikicat_unincorporated_communities_in_new_orle...
114506        wikicat_men's_feldhockey_bundesliga_players
114507        wikicat_sports_clubs_disestablished_in_1951
114508           wikicat_magazines_disestablished_in_1950

[114509 rows x 1 columns]

And even if no embedding_table_id is provided:

>>> fetch_ken_types()
                                                     Type
0                      wikicat_italian_male_screenwriters
1       wikicat_21st-century_roman_catholic_archbishop...
2                      wikicat_2000s_romantic_drama_films
3                       wikicat_music_festivals_in_france
4             wikicat_20th-century_american_women_artists
...                                                   ...
114504                 wikicat_chicago_white_sox_managers
114505  wikicat_unincorporated_communities_in_new_orle...
114506        wikicat_men's_feldhockey_bundesliga_players
114507        wikicat_sports_clubs_disestablished_in_1951
114508           wikicat_magazines_disestablished_in_1950

[114509 rows x 1 columns]

Versions

>>> import sklearn, skrub; sklearn.show_versions(); print(skrub.__version__);

System:
    python: 3.11.7 (main, Dec  8 2023, 18:56:58) [GCC 11.4.0]
executable: /home/corey/.virtualenvs/skrub-data/bin/python
   machine: Linux-6.5.0-18-generic-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.4.1.post1
          pip: 24.0
   setuptools: 69.1.0
        numpy: 1.26.4
        scipy: 1.12.0
       Cython: None
       pandas: 2.2.0
   matplotlib: 3.8.2
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 64
         prefix: libgomp
       filepath: /home/corey/.virtualenvs/skrub-data/lib/python3.11/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 64
         prefix: libopenblas
       filepath: /home/corey/.virtualenvs/skrub-data/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: Zen

       user_api: blas
   internal_api: openblas
    num_threads: 64
         prefix: libopenblas
       filepath: /home/corey/.virtualenvs/skrub-data/lib/python3.11/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: Zen
0.1.0

thanks for reporting this bug! @jovan-stojanovic said he is interested in working on it

Hi @cbilot, thanks for reporting this bug! We apparently forgot to add the types tables for the categories you mentioned.
The following PR should resolve it, skrub-data/datasets#8