HunterMcGushion/hyperparameter_hunter

Dataset hashes change between Pandas versions

HunterMcGushion opened this issue · 0 comments

Problem

  • Updating Pandas from 0.24.2 to 0.25.0 produces different Environment.cross_experiment_key values
    • Of all parameters included in the cross_experiment_key, only the (hashed) values of the
      datasets actually changed
  • This can be seen in the broken TravisCI build #738
    • Note: This was made for PR #165, fixing an unrelated bug concerning Categorical.optional Feature Engineering steps
  • Specifically, the following tests in tests.test_environment are failing:
    • test_environment_init_cv_params (both scenarios)
    • test_environment_init_metrics (both scenarios)
    • test_environment_init_cross_experiment_params (both scenarios)
  • All six of the failing tests are checking for expected Environment.cross_experiment_key

Battle-Plan

  • Need dataset hashes to be consistent across different Pandas versions
  • Add new clause to :func:hyperparameter_hunter.keys.hashing.to_hashable to handle DataFrames
    • Because to_hashable is used by make_hash_sha256, this change will apply not only to key hashes, but also to the WIP hashes generated by :mod:feature_engineering to track changes made by different EngineerSteps

Options for New DataFrame Clause

return hashlib.sha256(
    pd.util.hash_pandas_object(obj, index=True).values
).hexdigest()

or something like

return obj.to_csv().encode("utf-8")

Both produce consistent values for datasets for Pandas 0.24.2 and 0.25.0. However, the first feels safer, whereas the second is easier to understand and follows some representation of the object, rather than an actual hash, which is the intended purpose of to_hashable