fetch_ken_types gives same results for many embedding_table_id's
Closed this issue · 2 comments
cbilot commented
Describe the bug
Using the fetch_ken_types
method with different embedding_table_id
values often yields the exact same dataset. I'm not sure if this is intended.
Steps/Code to Reproduce
from skrub.datasets import fetch_ken_types
fetch_ken_types(embedding_table_id="companies")
fetch_ken_types(embedding_table_id="schools")
fetch_ken_types(embedding_table_id="movies")
fetch_ken_types(embedding_table_id="albums")
fetch_ken_types(embedding_table_id="games")
fetch_ken_types()
Expected Results
For embedding_table_id = "games"
, we seem to get the correct result:
>>> from skrub.datasets import fetch_ken_types
>>> fetch_ken_types(embedding_table_id="games")
Type
0 wikicat_1994_video_games
1 wikicat_irem_games
2 wikicat_ea_guingamp_players
3 wikicat_video_game_companies_of_the_united_kin...
4 wikicat_asian_games_medalists_in_swimming
.. ...
636 wikicat_college_football_games
637 wikicat_sonic_team_games
638 wikicat_space_opera_video_games
639 wikicat_boxers_at_the_2002_asian_games
640 wikicat_motorcycle_video_games
[641 rows x 1 columns]
Actual Results
However, for other values of embedding_table_id
, we get the same results:
>> fetch_ken_types(embedding_table_id="companies")
Type
0 wikicat_italian_male_screenwriters
1 wikicat_21st-century_roman_catholic_archbishop...
2 wikicat_2000s_romantic_drama_films
3 wikicat_music_festivals_in_france
4 wikicat_20th-century_american_women_artists
... ...
114504 wikicat_chicago_white_sox_managers
114505 wikicat_unincorporated_communities_in_new_orle...
114506 wikicat_men's_feldhockey_bundesliga_players
114507 wikicat_sports_clubs_disestablished_in_1951
114508 wikicat_magazines_disestablished_in_1950
[114509 rows x 1 columns]
>>> fetch_ken_types(embedding_table_id="schools")
Type
0 wikicat_italian_male_screenwriters
1 wikicat_21st-century_roman_catholic_archbishop...
2 wikicat_2000s_romantic_drama_films
3 wikicat_music_festivals_in_france
4 wikicat_20th-century_american_women_artists
... ...
114504 wikicat_chicago_white_sox_managers
114505 wikicat_unincorporated_communities_in_new_orle...
114506 wikicat_men's_feldhockey_bundesliga_players
114507 wikicat_sports_clubs_disestablished_in_1951
114508 wikicat_magazines_disestablished_in_1950
[114509 rows x 1 columns]
>>> fetch_ken_types(embedding_table_id="movies")
Type
0 wikicat_italian_male_screenwriters
1 wikicat_21st-century_roman_catholic_archbishop...
2 wikicat_2000s_romantic_drama_films
3 wikicat_music_festivals_in_france
4 wikicat_20th-century_american_women_artists
... ...
114504 wikicat_chicago_white_sox_managers
114505 wikicat_unincorporated_communities_in_new_orle...
114506 wikicat_men's_feldhockey_bundesliga_players
114507 wikicat_sports_clubs_disestablished_in_1951
114508 wikicat_magazines_disestablished_in_1950
[114509 rows x 1 columns]
>>> fetch_ken_types(embedding_table_id="albums")
Type
0 wikicat_italian_male_screenwriters
1 wikicat_21st-century_roman_catholic_archbishop...
2 wikicat_2000s_romantic_drama_films
3 wikicat_music_festivals_in_france
4 wikicat_20th-century_american_women_artists
... ...
114504 wikicat_chicago_white_sox_managers
114505 wikicat_unincorporated_communities_in_new_orle...
114506 wikicat_men's_feldhockey_bundesliga_players
114507 wikicat_sports_clubs_disestablished_in_1951
114508 wikicat_magazines_disestablished_in_1950
[114509 rows x 1 columns]
And even if no embedding_table_id
is provided:
>>> fetch_ken_types()
Type
0 wikicat_italian_male_screenwriters
1 wikicat_21st-century_roman_catholic_archbishop...
2 wikicat_2000s_romantic_drama_films
3 wikicat_music_festivals_in_france
4 wikicat_20th-century_american_women_artists
... ...
114504 wikicat_chicago_white_sox_managers
114505 wikicat_unincorporated_communities_in_new_orle...
114506 wikicat_men's_feldhockey_bundesliga_players
114507 wikicat_sports_clubs_disestablished_in_1951
114508 wikicat_magazines_disestablished_in_1950
[114509 rows x 1 columns]
Versions
>>> import sklearn, skrub; sklearn.show_versions(); print(skrub.__version__);
System:
python: 3.11.7 (main, Dec 8 2023, 18:56:58) [GCC 11.4.0]
executable: /home/corey/.virtualenvs/skrub-data/bin/python
machine: Linux-6.5.0-18-generic-x86_64-with-glibc2.35
Python dependencies:
sklearn: 1.4.1.post1
pip: 24.0
setuptools: 69.1.0
numpy: 1.26.4
scipy: 1.12.0
Cython: None
pandas: 2.2.0
matplotlib: 3.8.2
joblib: 1.3.2
threadpoolctl: 3.2.0
Built with OpenMP: True
threadpoolctl info:
user_api: openmp
internal_api: openmp
num_threads: 64
prefix: libgomp
filepath: /home/corey/.virtualenvs/skrub-data/lib/python3.11/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
version: None
user_api: blas
internal_api: openblas
num_threads: 64
prefix: libopenblas
filepath: /home/corey/.virtualenvs/skrub-data/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-0cf96a72.3.23.dev.so
version: 0.3.23.dev
threading_layer: pthreads
architecture: Zen
user_api: blas
internal_api: openblas
num_threads: 64
prefix: libopenblas
filepath: /home/corey/.virtualenvs/skrub-data/lib/python3.11/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
version: 0.3.21.dev
threading_layer: pthreads
architecture: Zen
0.1.0
jeromedockes commented
thanks for reporting this bug! @jovan-stojanovic said he is interested in working on it
jovan-stojanovic commented
Hi @cbilot, thanks for reporting this bug! We apparently forgot to add the types tables for the categories you mentioned.
The following PR should resolve it, skrub-data/datasets#8