
Polars deprecation inbound

Closed this issue · 4 comments

Describe the bug

I was working on a dataset with skrub together with polars and I noticed this warning pop up while running the TableVectorizer.

[/Users/vincent/Development/probabl/venv/lib/python3.11/site-packages/skrub/](http://localhost:8888/lab/workspaces/auto-v/tree/venv/lib/python3.11/site-packages/skrub/ FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  column = column.replace(STR_NA_VALUES, np.nan).replace(r"^\s+$", np.nan, regex=True)

Figured I'd mention it here just as a ping. It seems like something that may break in the future. If folks appreciate it I can also have a deeper look.

Steps/Code to Reproduce

dicts = [{'id': 0,
  'Gender': 'Male',
  'Age': 24.443011,
  'Height': 1.699998,
  'Weight': 81.66995,
  'family_history_with_overweight': 'yes',
  'FAVC': 'yes',
  'FCVC': 2.0,
  'NCP': 2.983297,
  'CAEC': 'Sometimes',
  'SMOKE': 'no',
  'CH2O': 2.763573,
  'SCC': 'no',
  'FAF': 0.0,
  'TUE': 0.976473,
  'CALC': 'Sometimes',
  'MTRANS': 'Public_Transportation'},
 {'id': 1,
  'Gender': 'Female',
  'Age': 18.0,
  'Height': 1.56,
  'Weight': 57.0,
  'family_history_with_overweight': 'yes',
  'FAVC': 'yes',
  'FCVC': 2.0,
  'NCP': 3.0,
  'CAEC': 'Frequently',
  'SMOKE': 'no',
  'CH2O': 2.0,
  'SCC': 'no',
  'FAF': 1.0,
  'TUE': 1.0,
  'CALC': 'no',
  'MTRANS': 'Automobile'},
 {'id': 2,
  'Gender': 'Female',
  'Age': 18.0,
  'Height': 1.71146,
  'Weight': 50.165754,
  'family_history_with_overweight': 'yes',
  'FAVC': 'yes',
  'FCVC': 1.880534,
  'NCP': 1.411685,
  'CAEC': 'Sometimes',
  'SMOKE': 'no',
  'CH2O': 1.910378,
  'SCC': 'no',
  'FAF': 0.866045,
  'TUE': 1.673584,
  'CALC': 'no',
  'MTRANS': 'Public_Transportation'},
 {'id': 3,
  'Gender': 'Female',
  'Age': 20.952737,
  'Height': 1.71073,
  'Weight': 131.274851,
  'family_history_with_overweight': 'yes',
  'FAVC': 'yes',
  'FCVC': 3.0,
  'NCP': 3.0,
  'CAEC': 'Sometimes',
  'SMOKE': 'no',
  'CH2O': 1.674061,
  'SCC': 'no',
  'FAF': 1.467863,
  'TUE': 0.780199,
  'CALC': 'Sometimes',
  'MTRANS': 'Public_Transportation'},
 {'id': 4,
  'Gender': 'Male',
  'Age': 31.641081,
  'Height': 1.914186,
  'Weight': 93.798055,
  'family_history_with_overweight': 'yes',
  'FAVC': 'yes',
  'FCVC': 2.679664,
  'NCP': 1.971472,
  'CAEC': 'Sometimes',
  'SMOKE': 'no',
  'CH2O': 1.979848,
  'SCC': 'no',
  'FAF': 1.967973,
  'TUE': 0.931721,
  'CALC': 'Sometimes',
  'MTRANS': 'Public_Transportation'}]

import polars as pl
from skrub import TableVectorizer


Expected Results

No warning would be swell.

Actual Results

Mentioned above.


    python: 3.11.6 (v3.11.6:8b6ee5ba3b, Oct  2 2023, 11:18:21) [Clang 13.0.0 (clang-1300.0.29.30)]
executable: /Users/vincent/Development/probabl/venv/bin/python
   machine: macOS-13.4.1-arm64-arm-64bit

Python dependencies:
      sklearn: 1.4.0
          pip: 23.2.1
   setuptools: 65.5.0
        numpy: 1.26.3
        scipy: 1.12.0
       Cython: None
       pandas: 2.2.0
   matplotlib: 3.8.2
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/vincent/Development/probabl/venv/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
threading_layer: pthreads
   architecture: armv8

       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /Users/vincent/Development/probabl/venv/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/vincent/Development/probabl/venv/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
threading_layer: pthreads
   architecture: armv8

thanks for the report! the deprecation seems to actually be from pandas because the current version of the tablevectorizer just transforms polars dataframes to pandas, which will change after #877 . Also after that PR pd.NA should be used instead of np.nan so the warning might go away

here is a minimal reproducer, to add as a regression test:

import polars as pl
from skrub import TableVectorizer

df = pl.DataFrame(dict(a=[0, 1], b=['a', 'b']))

(the problem seems to be that check_X would do pd.DataFrame(polars_dataframe) and end up with all object columns)

checked that this is fixed after #902