timestamps are not necessarily unique
minrk opened this issue · 1 comments
Playing with vbench for pyzmq, I discovered that my timestamps are not unique (610 unique timestamps in 611 commits), which causes an error when parsing the git log:
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
/Users/minrk/dev/ip/pyzmq/tools/bench/multipart.py in <module>()
55 runner = BenchmarkRunner(benchmarks, REPO_PATH, REPO_URL,
56 BUILD, DB_PATH, TMP_DIR, PREPARE,
---> 57 run_option='eod', start_date=START_DATE)
58 runner.run()
/Users/minrk/dev/py/vbench/vbench/runner.pyc in __init__(self, benchmarks, repo_path, repo_url, build_cmd, db_path, tmp_dir, preparation_cmd, run_option, start_date, overwrite)
39 self.db_path = db_path
40
---> 41 self.repo = GitRepo(self.repo_path)
42 self.db = BenchmarkDB(db_path)
43
/Users/minrk/dev/py/vbench/vbench/git.py in __init__(self, repo_path)
26 self.repo_path = repo_path
27 self.git = _git_command(self.repo_path)
---> 28 (self.shas, self.messages, self.timestamps) = self._parse_commit_log()
29
30 @property
/Users/minrk/dev/py/vbench/vbench/git.py in _parse_commit_log(self)
63 import IPython
64 IPython.embed()
---> 65 return shas[::-1], messages[::-1], timestamps[::-1]
66
67 def get_churn(self, omit_shas=None, omit_paths=None):
/Users/minrk/dev/py/pandas/pandas/core/series.pyc in __getitem__(self, key)
252 # Label-based
253 try:
--> 254 return self.index._engine.get_value(self, key)
255 except KeyError, e1:
256 if isinstance(self.index, MultiIndex):
/Users/minrk/dev/py/pandas/pandas/_engines.so in pandas._engines.IndexEngine.get_value (pandas/src/engines.c:1474)()
/Users/minrk/dev/py/pandas/pandas/_engines.so in pandas._engines.IndexEngine.get_value (pandas/src/engines.c:1369)()
/Users/minrk/dev/py/pandas/pandas/_engines.so in pandas._engines.DictIndexEngine.get_loc (pandas/src/engines.c:2467)()
/Users/minrk/dev/py/pandas/pandas/_engines.so in pandas._engines.DictIndexEngine.get_loc (pandas/src/engines.c:2416)()
Exception: Index values are not unique
So either you can't rely on timestamps as an index or you need to handle timestamp collisions, which are presumably rare. In my local copy, I just skipped entries that collide with previously parsed ones.
For the same reason, you might use the whole sha instead of the abbreviation.
This is also a bit related to a minor internal pandas issue that I just pushed a fix for (it should be possible to do shas[::-1]
while avoiding the "uniqueness" check). Probably will want to write some tests at some point to get robust handling of duplicate timestamps / git hashes