Dataset hashes change between Pandas versions
HunterMcGushion opened this issue · 0 comments
HunterMcGushion commented
Problem
- Updating Pandas from 0.24.2 to 0.25.0 produces different
Environment.cross_experiment_key
values- Of all parameters included in the
cross_experiment_key
, only the (hashed) values of the
datasets actually changed
- Of all parameters included in the
- This can be seen in the broken TravisCI build #738
- Note: This was made for PR #165, fixing an unrelated bug concerning
Categorical.optional
Feature Engineering steps
- Note: This was made for PR #165, fixing an unrelated bug concerning
- Specifically, the following tests in
tests.test_environment
are failing:test_environment_init_cv_params
(both scenarios)test_environment_init_metrics
(both scenarios)test_environment_init_cross_experiment_params
(both scenarios)
- All six of the failing tests are checking for expected
Environment.cross_experiment_key
Battle-Plan
- Need dataset hashes to be consistent across different Pandas versions
- Add new clause to :func:
hyperparameter_hunter.keys.hashing.to_hashable
to handle DataFrames- Because
to_hashable
is used bymake_hash_sha256
, this change will apply not only to key hashes, but also to the WIP hashes generated by :mod:feature_engineering
to track changes made by differentEngineerStep
s
- Because
Options for New DataFrame Clause
return hashlib.sha256(
pd.util.hash_pandas_object(obj, index=True).values
).hexdigest()
or something like
return obj.to_csv().encode("utf-8")
Both produce consistent values for datasets for Pandas 0.24.2 and 0.25.0. However, the first feels safer, whereas the second is easier to understand and follows some representation of the object, rather than an actual hash, which is the intended purpose of to_hashable