Cant iterate over files in angularjs git-repo
Knniff opened this issue · 4 comments
Describe the bug
While trying to iterate over all files in the angularjs git-repository (code used is show below) python crashes with this stack-trace:
Traceback (most recent call last):
File "/home/test/project/main.py", line 12, in <module>
for file in commit.modified_files:
File "/home/test/project/.venv/lib/python3.10/site-packages/pydriller/domain/commit.py", line 716, in modified_files
return self._parse_diff(diff_index)
File "/home/test/project/.venv/lib/python3.10/site-packages/pydriller/domain/commit.py", line 728, in _parse_diff
"content": self._get_undecoded_content(diff.b_blob),
File "/home/test/project/.venv/lib/python3.10/site-packages/pydriller/domain/commit.py", line 752, in _get_undecoded_content
return blob.data_stream.read() if blob is not None else None
File "/home/test/project/.venv/lib/python3.10/site-packages/git/objects/base.py", line 142, in data_stream
return self.repo.odb.stream(self.binsha)
File "/home/test/project/.venv/lib/python3.10/site-packages/git/db.py", line 45, in stream
hexsha, typename, size, stream = self._git.stream_object_data(bin_to_hex(binsha))
File "/home/test/project/.venv/lib/python3.10/site-packages/git/cmd.py", line 1400, in stream_object_data
hexsha, typename, size = self.__get_object_header(cmd, ref)
File "/home/test/project/.venv/lib/python3.10/site-packages/git/cmd.py", line 1370, in __get_object_header
return self._parse_object_header(cmd.stdout.readline())
File "/home/test/project/.venv/lib/python3.10/site-packages/git/cmd.py", line 1331, in _parse_object_header
raise ValueError("SHA %s could not be resolved, git returned: %r" % (tokens[0], header_line.strip()))
ValueError: SHA b'4e1ebfdefda333354bbda71e172daa5db4808616' could not be resolved, git returned: b'4e1ebfdefda333354bbda71e172daa5db4808616 missing'
This is probably not an error/bug in pydriller directly, but because of my limited knowledge of the underlying libraries i couldnt reproduce the error otherwise. I tried to reproduce it with this but got no error:
from git import Repo
repo = Repo('./angular')
commits = repo.iter_commits()
for commit in commits:
for file in commit.tree.blobs:
print(file.name)
To Reproduce
Clone https://github.com/angular/angular and try to iterate over all files with:
for commit in pydriller.Repository(
"./angular").traverse_commits():
for file in commit.modified_files:
print(file.filename, file.change_type)
OS Version:
Linux: Ubuntu 22.04 with Python 3.10.6 and PyDriller 2.4.1
Hi @Knniff! If it can't be repro with GitPython it means it's something related to Pydriller. Thanks for flagging, I'll look into it :)
The problem can be repro-ed in GitPython as well. The problem is that the commit belongs to a sub-project:
diff --git a/tools/js2dart b/tools/js2dart
new file mode 160000
index 0000000000000000000000000000000000000000..4e1ebfdefda333354bbda71e172daa5db4808616
--- /dev/null
+++ b/tools/js2dart
@@ -0,0 +1 @@
+Subproject commit 4e1ebfdefda333354bbda71e172daa5db4808616
Unfortunately, what I generally do in these cases is to run git submodules --init
. However, it seems that angular stopped using submodules, so nothing happens.
The only thing you can do at this point is to add a try/catch.
However, you made me notice a something, Pydriller shouldn't probably return an exception in this case.
Would a try/catch mean that the repository gets processed further after the error happens?
Yep that will do 👍