Crashes mining any git repos on Python 3.9
Closed this issue · 10 comments
Describe the bug
Attempting to run git2net.mine_git_repo
on a variety of cloned repositories, but I get an internal crash. The SQLite database is created but has only the _metadata
table in its schema, with a single entry containing the git2net version.
Git2net works as expected in Python 2.7 on the same system, leading me to believe that this is a 2.9 incompatibility problem and not a MacOS compatibility issue
Screenshots
Traceback (most recent call last):
File "/Users/foo/test.py", line 133, in <module>
git2net.mine_git_repo("foo", "foo.db")
File "/Users/foo/Library/Python/3.9/lib/python/site-packages/git2net/extraction.py", line 1564, in mine_git_repo
_process_repo_parallel(git_repo_dir, sqlite_db_file, u_commits, extraction_settings)
File "/Users/foo/Library/Python/3.9/lib/python/site-packages/git2net/extraction.py", line 1166, in _process_repo_parallel
with multiprocessing.Pool(extraction_settings['no_of_processes'],
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/pool.py", line 212, in __init__
self._repopulate_pool()
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/pool.py", line 303, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/pool.py", line 326, in _repopulate_pool_static
w.start()
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object '_process_repo_parallel.<locals>._init'
Desktop (please complete the following information):
- OS: MacOS 10.13
- Python: 3.9.0
- git2net 1.4.8
Hi Milo,
I just spent an hour installing Python 3.9 and trying to replicate the issue you are seeing. Unfortunately, I was not able to reproduce it on my machine (Ubuntu 20.04). Also, I currently don't have a Mac available to test it there.
Given how recent Python 3.9 is it could very well be that it's either a Mac-specific issue or it might have already been fixed within the last 3 days (my Python version is 3.9.0+ after installaton).
Could you try and run git2net with the option no_of_processes=1
? This will cause git2net to run without the multiprocessing pool and should lead to some clearer error messages.
Cheers,
Christoph
Hi there!
I'm sorry this problem hasn't been more straightforward to reproduce! I can confirm git2net works as desired when the multiprocessing pool is disabled, no crashes and a complete database. I didn't realize 3.9 was quite so new, I just grabbed the latest installer off Python.org. Since that packaged installer is still 3.9.0 it may be missing bleeding-edge updates you have.
Yeah, 3.9 only came out a month ago. I suggest you stick to 3.7 or 3.8 for now as that's also the recommendation I see around the web. Likely by the end of the year most of these issues will probably be sorted out. Given that it works without the multiprocessing pool it might mean there are some issues related to that in the current version.
In case you receive any more specific error messages at some point please feel free to drop them here so I can check them out. Otherwise I would of course also be happy to hear from you in case it starts to work so I can close this issue.
Cheers,
Christoph
Sounds good! I'll follow up with more information one way or another. I needed to update from the stock Python that shipped with MacOS and didn't look closely when I grabbed the default installer off the Python website - this certainly seems like a "new release has some bugs in it" problem. Sorry about that!
Nothing to be sorry about, I always love to check out the latest gadgets ;-) Looking forward to hearing from you!
I seem to have similar issue on python 3.8.1 on Windows 10. But a new issue occurred when I try to run git2net with single process. I was trying to run git2net.mine_git_repo
on a GitHub repository https://github.com/ppy/osu.git
which I had manually cloned to directory "osu". Here's what happened.
running git2net.mine_git_repo('./osu', './osu.git2net.db', no_of_processes=1)
outputs:
Found no database on provided path. Starting from scratch.
Serial: 0%| | 5/36929 [00:15<32:29:11, 3.17s/it]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-cd86d01865a9> in <module>
----> 1 git2net.mine_git_repo('./osu', './osu.git2net.db', no_of_processes=1)
~\AppData\Local\Programs\Python\Python38\lib\site-packages\git2net\extraction.py in mine_git_repo(git_repo_dir, sqlite_db_file, commits, use_blocks, no_of_processes, chunksize, exclude, blame_C, blame_w, max_modifications, timeout, extract_text, extract_complexity, extract_merges, extract_merge_deletions, all_branches)
1570 _process_repo_parallel(git_repo_dir, sqlite_db_file, u_commits, extraction_settings)
1571 else:
-> 1572 _process_repo_serial(git_repo_dir, sqlite_db_file, u_commits, extraction_settings)
1573
1574
~\AppData\Local\Programs\Python\Python38\lib\site-packages\git2net\extraction.py in _process_repo_serial(git_repo_dir, sqlite_db_file, commits, extraction_settings)
1134 for commit in tqdm(commits, desc='Serial'):
1135 args = {'git_repo_dir': git_repo_dir, 'commit_hash': commit.hash, 'extraction_settings': extraction_settings}
-> 1136 result = _process_commit(args)
1137
1138 with sqlite3.connect(sqlite_db_file) as con:
~\AppData\Local\Programs\Python\Python38\lib\site-packages\git2net\extraction.py in _process_commit(args)
1100 exclude_file = True
1101 if not exclude_file:
-> 1102 df_edits = df_edits.append(_extract_edits(git_repo, commit, modification,
1103 args['extraction_settings']),
1104 ignore_index=True, sort=True)
~\AppData\Local\Programs\Python\Python38\lib\site-packages\git2net\extraction.py in _extract_edits(git_repo, commit, modification, extraction_settings)
651 extraction_settings['blame_options'],
652 modification.new_path)
--> 653 blame_info_commit = _parse_porcelain_blame(blame_commit)
654
655 except GitCommandError:
~\AppData\Local\Programs\Python\Python38\lib\site-packages\git2net\extraction.py in _parse_porcelain_blame(blame)
319 elif entries[0] == 'filename':
320 filename = entries[1]
--> 321 blame_info = pd.DataFrame(l)
322 return blame_info
323
~\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
433 )
434 elif isinstance(data, dict):
--> 435 mgr = init_dict(data, index, columns, dtype=dtype)
436 elif isinstance(data, ma.MaskedArray):
437 import numpy.ma.mrecords as mrecords
~\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\internals\construction.py in init_dict(data, index, columns, dtype)
252 arr if not is_datetime64tz_dtype(arr) else arr.copy() for arr in arrays
253 ]
--> 254 return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
255
256
~\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\internals\construction.py in arrays_to_mgr(arrays, arr_names, index, columns, dtype)
62 # figure out the index, if necessary
63 if index is None:
---> 64 index = extract_index(arrays)
65 else:
66 index = ensure_index(index)
~\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\internals\construction.py in extract_index(data)
363 lengths = list(set(raw_lengths))
364 if len(lengths) > 1:
--> 365 raise ValueError("arrays must all be same length")
366
367 if have_dicts:
ValueError: arrays must all be same length
git2net version: 1.4.10
Hi xjr01,
I have tried to replicate your issue yesterday but used multiprocessing to save time. That way, the crawl finished without any issues on my machine. I am now running it in a single thread to see if this helps me replicate your error.
A full run on a single thread will take the entire day. That said, I am already further than your crawl and have not been able to replicate your issue.
Thus, I am afraid its another incompatibility between package versions. Can you give me the current versions of git and PyDriller that you are using? My best guess is that this could be an incompatibility with your version of git as older versions return a slightly different output for git blame. While I have already addressed this in the requirements, I have not tested all versions of git and therefore might have been too lenient concerning the required version.
Cheers,
Christoph
Hi Christoph,
Thanks for your reply. My git version is 2.29.0.windows.1
and pydriller version is 1.15.2
. They're all quite new.
However, git2net runs smoothly (with multi-process) on my ubuntu 18.04 virtual machine, with git version 2.17.1
and pydriller version 1.15.2
.
Given that the pydriller versions are identical it might have to do with either git 2.29 or windows. I will set up a windows VM and try to replicate the issue there.
Best,
Christoph
Sorry for the delayed response. Unfortunately, I was not able to replicate the issue on my Windows VM and got very busy with my dissertation afterwards.
I have now spent the last few days trying to replicate the issue on a full Windows system (no VM) with
python 3.9.2
git 2.31.1.windows.1
I was able to replicate the issue reported by @milo-trujillo which I fixed for git2net 1.5.0
.
Unfortunately, I was again not able to replicate the issue reported by @xjr01.
However, I can confirm that with the settings above and git2net 1.5.0
I was able to fully mine the OSU repository (https://github.com/ppy/osu.git
) in both the single and multi-threaded setup without any errors.
With this, I close this issue. Please feel free to reopen in case your issues still persist.
Cheers,
Christoph