wesm/vbench

timestamps are not necessarily unique

minrk opened this issue · 1 comments

Playing with vbench for pyzmq, I discovered that my timestamps are not unique (610 unique timestamps in 611 commits), which causes an error when parsing the git log:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
/Users/minrk/dev/ip/pyzmq/tools/bench/multipart.py in <module>()
     55 runner = BenchmarkRunner(benchmarks, REPO_PATH, REPO_URL,
     56                          BUILD, DB_PATH, TMP_DIR, PREPARE,
---> 57                          run_option='eod', start_date=START_DATE)
     58 runner.run()

/Users/minrk/dev/py/vbench/vbench/runner.pyc in __init__(self, benchmarks, repo_path, repo_url, build_cmd, db_path, tmp_dir, preparation_cmd, run_option, start_date, overwrite)
     39         self.db_path = db_path
     40 
---> 41         self.repo = GitRepo(self.repo_path)
     42         self.db = BenchmarkDB(db_path)
     43 

/Users/minrk/dev/py/vbench/vbench/git.py in __init__(self, repo_path)
     26         self.repo_path = repo_path
     27         self.git = _git_command(self.repo_path)
---> 28         (self.shas, self.messages, self.timestamps) = self._parse_commit_log()
     29 
     30     @property

/Users/minrk/dev/py/vbench/vbench/git.py in _parse_commit_log(self)
     63         import IPython
     64         IPython.embed()
---> 65         return shas[::-1], messages[::-1], timestamps[::-1]
     66 
     67     def get_churn(self, omit_shas=None, omit_paths=None):

/Users/minrk/dev/py/pandas/pandas/core/series.pyc in __getitem__(self, key)
    252         # Label-based

    253         try:
--> 254             return self.index._engine.get_value(self, key)
    255         except KeyError, e1:
    256             if isinstance(self.index, MultiIndex):

/Users/minrk/dev/py/pandas/pandas/_engines.so in pandas._engines.IndexEngine.get_value (pandas/src/engines.c:1474)()

/Users/minrk/dev/py/pandas/pandas/_engines.so in pandas._engines.IndexEngine.get_value (pandas/src/engines.c:1369)()

/Users/minrk/dev/py/pandas/pandas/_engines.so in pandas._engines.DictIndexEngine.get_loc (pandas/src/engines.c:2467)()

/Users/minrk/dev/py/pandas/pandas/_engines.so in pandas._engines.DictIndexEngine.get_loc (pandas/src/engines.c:2416)()

Exception: Index values are not unique

So either you can't rely on timestamps as an index or you need to handle timestamp collisions, which are presumably rare. In my local copy, I just skipped entries that collide with previously parsed ones.

For the same reason, you might use the whole sha instead of the abbreviation.

wesm commented

This is also a bit related to a minor internal pandas issue that I just pushed a fix for (it should be possible to do shas[::-1] while avoiding the "uniqueness" check). Probably will want to write some tests at some point to get robust handling of duplicate timestamps / git hashes