datetimeencoder is very slow
MarcoGorelli opened this issue · 3 comments
MarcoGorelli commented
Describe the bug
Looks like the format is being guessed for every single element, twice (once with day first, once with month first)
np.vectorize
doesn't speed things up, it's just syntactic sugar
The code below has just 70 thousand rows, but it 14 seconds to execute on my laptop
Steps/Code to Reproduce
from pprint import pprint
import pandas as pd
data = pd.DataFrame({
'date.utc': pd.date_range('1900-01-01', '2100-01-01', freq='1D').strftime('%Y-%m-%d'),
'city': 'Paris',
'value': 3.
}
)
print('data shape: ', data.shape)
# Extract our input data (X) and the target column (y)
y = data["value"]
X = data[["city", "date.utc"]]
X
from skrub import to_datetime
X = to_datetime(X)
X.dtypes
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from skrub import DatetimeEncoder
encoder = make_column_transformer(
(OneHotEncoder(handle_unknown="ignore"), ["city"]),
(DatetimeEncoder(add_day_of_the_week=True, resolution="minute"), ["date.utc"]),
remainder="drop",
)
X_enc = encoder.fit_transform(X)
pprint(encoder.get_feature_names_out())
Expected Results
No more than 1 second, probably 😄
Actual Results
the results are correct, just too slow
Versions
System:
python: 3.11.6 (main, Oct 23 2023, 22:48:54) [GCC 11.4.0]
executable: /home/marcogorelli/skrub-dev/.venv/bin/python
machine: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python dependencies:
sklearn: 1.3.0
pip: 23.1.2
setuptools: 65.5.0
numpy: 1.25.2
scipy: 1.11.2
Cython: None
pandas: 2.1.1
matplotlib: None
joblib: 1.3.2
threadpoolctl: 3.2.0
Built with OpenMP: True
threadpoolctl info:
user_api: openmp
internal_api: openmp
num_threads: 16
prefix: libgomp
filepath: /home/marcogorelli/skrub-dev/.venv/lib/python3.11/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
version: None
user_api: blas
internal_api: openblas
num_threads: 16
prefix: libopenblas
filepath: /home/marcogorelli/skrub-dev/.venv/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-5007b62f.3.23.dev.so
version: 0.3.23.dev
threading_layer: pthreads
architecture: SkylakeX
user_api: blas
internal_api: openblas
num_threads: 16
prefix: libopenblas
filepath: /home/marcogorelli/skrub-dev/.venv/lib/python3.11/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
version: 0.3.21.dev
threading_layer: pthreads
architecture: SkylakeX
0.0.1.dev0
GaelVaroquaux commented
Marco is right: we can't do this for every row by default.
We would need something like:
- take 10 rows
- check the format
- if it's consistent try to apply it on all rows
- if fail move to slow route.
Does that make sens?
Vincent-Maladiere commented
Thank you for spotting this, @MarcoGorelli. We haven't checked the computational performances yet, indeed.
I think subsampling makes sense here and is a very simple solution, @GaelVaroquaux. I'll open a PR.
GaelVaroquaux commented
You guys rock! I love how we are identifying the practical bottlenecks and solving them fast