skrub-data/skrub

Polars deprecation inbound

Closed this issue · 4 comments

Describe the bug

I was working on a dataset with skrub together with polars and I noticed this warning pop up while running the TableVectorizer.

[/Users/vincent/Development/probabl/venv/lib/python3.11/site-packages/skrub/_table_vectorizer.py:101](http://localhost:8888/lab/workspaces/auto-v/tree/venv/lib/python3.11/site-packages/skrub/_table_vectorizer.py#line=100): FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  column = column.replace(STR_NA_VALUES, np.nan).replace(r"^\s+$", np.nan, regex=True)

Figured I'd mention it here just as a ping. It seems like something that may break in the future. If folks appreciate it I can also have a deeper look.

Steps/Code to Reproduce

dicts = [{'id': 0,
  'Gender': 'Male',
  'Age': 24.443011,
  'Height': 1.699998,
  'Weight': 81.66995,
  'family_history_with_overweight': 'yes',
  'FAVC': 'yes',
  'FCVC': 2.0,
  'NCP': 2.983297,
  'CAEC': 'Sometimes',
  'SMOKE': 'no',
  'CH2O': 2.763573,
  'SCC': 'no',
  'FAF': 0.0,
  'TUE': 0.976473,
  'CALC': 'Sometimes',
  'MTRANS': 'Public_Transportation'},
 {'id': 1,
  'Gender': 'Female',
  'Age': 18.0,
  'Height': 1.56,
  'Weight': 57.0,
  'family_history_with_overweight': 'yes',
  'FAVC': 'yes',
  'FCVC': 2.0,
  'NCP': 3.0,
  'CAEC': 'Frequently',
  'SMOKE': 'no',
  'CH2O': 2.0,
  'SCC': 'no',
  'FAF': 1.0,
  'TUE': 1.0,
  'CALC': 'no',
  'MTRANS': 'Automobile'},
 {'id': 2,
  'Gender': 'Female',
  'Age': 18.0,
  'Height': 1.71146,
  'Weight': 50.165754,
  'family_history_with_overweight': 'yes',
  'FAVC': 'yes',
  'FCVC': 1.880534,
  'NCP': 1.411685,
  'CAEC': 'Sometimes',
  'SMOKE': 'no',
  'CH2O': 1.910378,
  'SCC': 'no',
  'FAF': 0.866045,
  'TUE': 1.673584,
  'CALC': 'no',
  'MTRANS': 'Public_Transportation'},
 {'id': 3,
  'Gender': 'Female',
  'Age': 20.952737,
  'Height': 1.71073,
  'Weight': 131.274851,
  'family_history_with_overweight': 'yes',
  'FAVC': 'yes',
  'FCVC': 3.0,
  'NCP': 3.0,
  'CAEC': 'Sometimes',
  'SMOKE': 'no',
  'CH2O': 1.674061,
  'SCC': 'no',
  'FAF': 1.467863,
  'TUE': 0.780199,
  'CALC': 'Sometimes',
  'MTRANS': 'Public_Transportation'},
 {'id': 4,
  'Gender': 'Male',
  'Age': 31.641081,
  'Height': 1.914186,
  'Weight': 93.798055,
  'family_history_with_overweight': 'yes',
  'FAVC': 'yes',
  'FCVC': 2.679664,
  'NCP': 1.971472,
  'CAEC': 'Sometimes',
  'SMOKE': 'no',
  'CH2O': 1.979848,
  'SCC': 'no',
  'FAF': 1.967973,
  'TUE': 0.931721,
  'CALC': 'Sometimes',
  'MTRANS': 'Public_Transportation'}]


import polars as pl
from skrub import TableVectorizer

TableVectorizer().fit_transform(pl.DataFrame(dicts))

Expected Results

No warning would be swell.

Actual Results

Mentioned above.

Versions

System:
    python: 3.11.6 (v3.11.6:8b6ee5ba3b, Oct  2 2023, 11:18:21) [Clang 13.0.0 (clang-1300.0.29.30)]
executable: /Users/vincent/Development/probabl/venv/bin/python
   machine: macOS-13.4.1-arm64-arm-64bit

Python dependencies:
      sklearn: 1.4.0
          pip: 23.2.1
   setuptools: 65.5.0
        numpy: 1.26.3
        scipy: 1.12.0
       Cython: None
       pandas: 2.2.0
   matplotlib: 3.8.2
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/vincent/Development/probabl/venv/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: armv8

       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /Users/vincent/Development/probabl/venv/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/vincent/Development/probabl/venv/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: armv8
0.1.0

thanks for the report! the deprecation seems to actually be from pandas because the current version of the tablevectorizer just transforms polars dataframes to pandas, which will change after #877 . Also after that PR pd.NA should be used instead of np.nan so the warning might go away

here is a minimal reproducer, to add as a regression test:

import polars as pl
from skrub import TableVectorizer

df = pl.DataFrame(dict(a=[0, 1], b=['a', 'b']))
TableVectorizer().fit_transform(df)

(the problem seems to be that check_X would do pd.DataFrame(polars_dataframe) and end up with all object columns)

checked that this is fixed after #902