Dataset hashes change between Pandas versions

Question

Dataset hashes change between Pandas versions

HunterMcGushion opened this issue 5 years ago · 0 comments

HunterMcGushion commented 5 years ago

Problem

Updating Pandas from 0.24.2 to 0.25.0 produces different Environment.cross_experiment_key values
- Of all parameters included in the cross_experiment_key, only the (hashed) values of the
  datasets actually changed
This can be seen in the broken TravisCI build #738
- Note: This was made for PR #165, fixing an unrelated bug concerning Categorical.optional Feature Engineering steps
Specifically, the following tests in tests.test_environment are failing:
- test_environment_init_cv_params (both scenarios)
- test_environment_init_metrics (both scenarios)
- test_environment_init_cross_experiment_params (both scenarios)
All six of the failing tests are checking for expected Environment.cross_experiment_key

Battle-Plan

Need dataset hashes to be consistent across different Pandas versions
Add new clause to :func:hyperparameter_hunter.keys.hashing.to_hashable to handle DataFrames
- Because to_hashable is used by make_hash_sha256, this change will apply not only to key hashes, but also to the WIP hashes generated by :mod:feature_engineering to track changes made by different EngineerSteps

Options for New DataFrame Clause

return hashlib.sha256(
    pd.util.hash_pandas_object(obj, index=True).values
).hexdigest()

or something like

return obj.to_csv().encode("utf-8")

Both produce consistent values for datasets for Pandas 0.24.2 and 0.25.0. However, the first feels safer, whereas the second is easier to understand and follows some representation of the object, rather than an actual hash, which is the intended purpose of to_hashable