gotec/git2net

Crashes mining any git repos on Python 3.9

Closed this issue · 10 comments

Describe the bug

Attempting to run git2net.mine_git_repo on a variety of cloned repositories, but I get an internal crash. The SQLite database is created but has only the _metadata table in its schema, with a single entry containing the git2net version.

Git2net works as expected in Python 2.7 on the same system, leading me to believe that this is a 2.9 incompatibility problem and not a MacOS compatibility issue

Screenshots

Traceback (most recent call last):
  File "/Users/foo/test.py", line 133, in <module>
    git2net.mine_git_repo("foo", "foo.db")
  File "/Users/foo/Library/Python/3.9/lib/python/site-packages/git2net/extraction.py", line 1564, in mine_git_repo
    _process_repo_parallel(git_repo_dir, sqlite_db_file, u_commits, extraction_settings)
  File "/Users/foo/Library/Python/3.9/lib/python/site-packages/git2net/extraction.py", line 1166, in _process_repo_parallel
    with multiprocessing.Pool(extraction_settings['no_of_processes'],
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/pool.py", line 212, in __init__
    self._repopulate_pool()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/pool.py", line 303, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/pool.py", line 326, in _repopulate_pool_static
    w.start()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object '_process_repo_parallel.<locals>._init'

Desktop (please complete the following information):

  • OS: MacOS 10.13
  • Python: 3.9.0
  • git2net 1.4.8
gotec commented

Hi Milo,

I just spent an hour installing Python 3.9 and trying to replicate the issue you are seeing. Unfortunately, I was not able to reproduce it on my machine (Ubuntu 20.04). Also, I currently don't have a Mac available to test it there.
Given how recent Python 3.9 is it could very well be that it's either a Mac-specific issue or it might have already been fixed within the last 3 days (my Python version is 3.9.0+ after installaton).

Could you try and run git2net with the option no_of_processes=1? This will cause git2net to run without the multiprocessing pool and should lead to some clearer error messages.

Cheers,
Christoph

Hi there!

I'm sorry this problem hasn't been more straightforward to reproduce! I can confirm git2net works as desired when the multiprocessing pool is disabled, no crashes and a complete database. I didn't realize 3.9 was quite so new, I just grabbed the latest installer off Python.org. Since that packaged installer is still 3.9.0 it may be missing bleeding-edge updates you have.

gotec commented

Yeah, 3.9 only came out a month ago. I suggest you stick to 3.7 or 3.8 for now as that's also the recommendation I see around the web. Likely by the end of the year most of these issues will probably be sorted out. Given that it works without the multiprocessing pool it might mean there are some issues related to that in the current version.
In case you receive any more specific error messages at some point please feel free to drop them here so I can check them out. Otherwise I would of course also be happy to hear from you in case it starts to work so I can close this issue.

Cheers,
Christoph

Sounds good! I'll follow up with more information one way or another. I needed to update from the stock Python that shipped with MacOS and didn't look closely when I grabbed the default installer off the Python website - this certainly seems like a "new release has some bugs in it" problem. Sorry about that!

gotec commented

Nothing to be sorry about, I always love to check out the latest gadgets ;-) Looking forward to hearing from you!

xjr01 commented

I seem to have similar issue on python 3.8.1 on Windows 10. But a new issue occurred when I try to run git2net with single process. I was trying to run git2net.mine_git_repo on a GitHub repository https://github.com/ppy/osu.git which I had manually cloned to directory "osu". Here's what happened.
running git2net.mine_git_repo('./osu', './osu.git2net.db', no_of_processes=1)
outputs:

Found no database on provided path. Starting from scratch.
Serial:   0%|          | 5/36929 [00:15<32:29:11,  3.17s/it]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-cd86d01865a9> in <module>
----> 1 git2net.mine_git_repo('./osu', './osu.git2net.db', no_of_processes=1)

~\AppData\Local\Programs\Python\Python38\lib\site-packages\git2net\extraction.py in mine_git_repo(git_repo_dir, sqlite_db_file, commits, use_blocks, no_of_processes, chunksize, exclude, blame_C, blame_w, max_modifications, timeout, extract_text, extract_complexity, extract_merges, extract_merge_deletions, all_branches)
   1570         _process_repo_parallel(git_repo_dir, sqlite_db_file, u_commits, extraction_settings)
   1571     else:
-> 1572         _process_repo_serial(git_repo_dir, sqlite_db_file, u_commits, extraction_settings)
   1573 
   1574 

~\AppData\Local\Programs\Python\Python38\lib\site-packages\git2net\extraction.py in _process_repo_serial(git_repo_dir, sqlite_db_file, commits, extraction_settings)
   1134     for commit in tqdm(commits, desc='Serial'):
   1135         args = {'git_repo_dir': git_repo_dir, 'commit_hash': commit.hash, 'extraction_settings': extraction_settings}
-> 1136         result = _process_commit(args)
   1137 
   1138         with sqlite3.connect(sqlite_db_file) as con:

~\AppData\Local\Programs\Python\Python38\lib\site-packages\git2net\extraction.py in _process_commit(args)
   1100                             exclude_file = True
   1101                 if not exclude_file:
-> 1102                     df_edits = df_edits.append(_extract_edits(git_repo, commit, modification,
   1103                                                               args['extraction_settings']),
   1104                                             ignore_index=True, sort=True)

~\AppData\Local\Programs\Python\Python38\lib\site-packages\git2net\extraction.py in _extract_edits(git_repo, commit, modification, extraction_settings)
    651                                                   extraction_settings['blame_options'],
    652                                                   modification.new_path)
--> 653                 blame_info_commit = _parse_porcelain_blame(blame_commit)
    654 
    655     except GitCommandError:

~\AppData\Local\Programs\Python\Python38\lib\site-packages\git2net\extraction.py in _parse_porcelain_blame(blame)
    319             elif entries[0] == 'filename':
    320                 filename = entries[1]
--> 321     blame_info = pd.DataFrame(l)
    322     return blame_info
    323 

~\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
    433             )
    434         elif isinstance(data, dict):
--> 435             mgr = init_dict(data, index, columns, dtype=dtype)
    436         elif isinstance(data, ma.MaskedArray):
    437             import numpy.ma.mrecords as mrecords

~\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\internals\construction.py in init_dict(data, index, columns, dtype)
    252             arr if not is_datetime64tz_dtype(arr) else arr.copy() for arr in arrays
    253         ]
--> 254     return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
    255 
    256 

~\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\internals\construction.py in arrays_to_mgr(arrays, arr_names, index, columns, dtype)
     62     # figure out the index, if necessary
     63     if index is None:
---> 64         index = extract_index(arrays)
     65     else:
     66         index = ensure_index(index)

~\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\internals\construction.py in extract_index(data)
    363             lengths = list(set(raw_lengths))
    364             if len(lengths) > 1:
--> 365                 raise ValueError("arrays must all be same length")
    366 
    367             if have_dicts:

ValueError: arrays must all be same length

git2net version: 1.4.10

gotec commented

Hi xjr01,

I have tried to replicate your issue yesterday but used multiprocessing to save time. That way, the crawl finished without any issues on my machine. I am now running it in a single thread to see if this helps me replicate your error.

A full run on a single thread will take the entire day. That said, I am already further than your crawl and have not been able to replicate your issue.

Thus, I am afraid its another incompatibility between package versions. Can you give me the current versions of git and PyDriller that you are using? My best guess is that this could be an incompatibility with your version of git as older versions return a slightly different output for git blame. While I have already addressed this in the requirements, I have not tested all versions of git and therefore might have been too lenient concerning the required version.

Cheers,
Christoph

xjr01 commented

Hi Christoph,
Thanks for your reply. My git version is 2.29.0.windows.1 and pydriller version is 1.15.2. They're all quite new.
However, git2net runs smoothly (with multi-process) on my ubuntu 18.04 virtual machine, with git version 2.17.1 and pydriller version 1.15.2.

gotec commented

Given that the pydriller versions are identical it might have to do with either git 2.29 or windows. I will set up a windows VM and try to replicate the issue there.

Best,
Christoph

gotec commented

Sorry for the delayed response. Unfortunately, I was not able to replicate the issue on my Windows VM and got very busy with my dissertation afterwards.
I have now spent the last few days trying to replicate the issue on a full Windows system (no VM) with

  • python 3.9.2
  • git 2.31.1.windows.1

I was able to replicate the issue reported by @milo-trujillo which I fixed for git2net 1.5.0.
Unfortunately, I was again not able to replicate the issue reported by @xjr01.
However, I can confirm that with the settings above and git2net 1.5.0 I was able to fully mine the OSU repository (https://github.com/ppy/osu.git) in both the single and multi-threaded setup without any errors.

With this, I close this issue. Please feel free to reopen in case your issues still persist.

Cheers,
Christoph