Memory leak over large repos
andreagurioli1995 opened this issue · 8 comments
Describe the bug
I found out that the process when it comes to retrieving the modified files in the commit, starts requiring a massive amount of memory, being specific, after profiling with memray, the incriminated section is addressed to "return diff.data_stream.read().decode("utf-8", "ignore")" over the decoding of the diffs.
To Reproduce
Here is an easy snippet to reproduce the issue.
from pydriller import Repository
url = "https://github.com/mozilla/addons-server"
for commit in Repository(path_to_repo=url,only_modifications_with_file_types=".py", num_workers=1).traverse_commits():
a= commit.modified_files
OS Version:
Linux
Hey @andreagurioli1995!
Interesting..would you mind sharing more data on this?
How big the repos? How much memory is it used?
I remember I had a problem in the past where I was keeping an instance of the commit active, so I was putting everything in memory. It shouldn't be like this anymore, so I am not sure where to start from 😄
The repo used is this one https://github.com/mozilla/addons-server (55,161 commits on 09/01/2022 19:05 GMT+1), with the settings shown above! The version of pydriller used is the 2.1.
During the memory profiling with https://github.com/bloomberg/memray I obtained an overall usage of 62.4 GByte. I have attached a screenshot of the memory profiling with the RAM usage in red over the salient points.
I also had the same problem。 my version: pydriller==2.1, python3.9.9, Linux: 5.10.0-60.18.0.50.h322_1.hce2.x86_64
I have tried manipulating the commit object in the new process and then freeing it, but the memory will still be used more and more. my code:
def get_commit_data(commit, meta: {}, file_num):
result: dict = {}
copy2dict(commit, result)
targe_file = f"{result_file_path}/result_{file_num}.json"
f = open(targe_file, mode='w')
meta['rawData'] = result
data = [meta]
f.write(json.dumps(data))
result = None
meta = None
data = None
f.flush()
f.close()
print(f"written commit:{commit.hash} to file:{targe_file}")
commit = None
def main():
meta = get_meta()
file_num = 0
for commit in Repository(path_to_repo=repo_path, only_in_branch=branch, from_commit=old_commit,
to_commit=new_commit).traverse_commits():
p = Process(target=get_commit_data, args=(commit, meta.copy(), file_num))
p.start()
p.join()
p.close()
commit = None
file_num += 1
if __name__ == '__main__':
main()
I also had the same problem。 my version: pydriller==2.1, python3.9.9, Linux: 5.10.0-60.18.0.50.h322_1.hce2.x86_64 I have tried manipulating the commit object in the new process and then freeing it, but the memory will still be used more and more. my code:
def get_commit_data(commit, meta: {}, file_num): result: dict = {} copy2dict(commit, result) targe_file = f"{result_file_path}/result_{file_num}.json" f = open(targe_file, mode='w') meta['rawData'] = result data = [meta] f.write(json.dumps(data)) result = None meta = None data = None f.flush() f.close() print(f"written commit:{commit.hash} to file:{targe_file}") commit = None def main(): meta = get_meta() file_num = 0 for commit in Repository(path_to_repo=repo_path, only_in_branch=branch, from_commit=old_commit, to_commit=new_commit).traverse_commits(): p = Process(target=get_commit_data, args=(commit, meta.copy(), file_num)) p.start() p.join() p.close() commit = None file_num += 1 if __name__ == '__main__': main()
In the ”copy2dict“ method, I copied all the attributes of the commit to the result object (including nested attributes, DFS)
So I looked a bit into this, and there is definitely something wrong with the commit object and modified files.
For each commit I save the list of modified files (https://github.com/ishepard/pydriller/blob/master/pydriller/domain/commit.py#L509). I do it so that consecutive calls to the modified files don't need to call diff every time, since it's very expensive.
However, by keeping that I have huge memory consumptions. Don't know why, since the commit object should be deleted once we move to a new commit, there is no reference back to the object. Apparently I'm wrong 😄
I tested it by deleting that line, and now there is little to zero memory consumption.
Ok I know why, not sure how I didn't see it before. When I added multi-thread support, I transformed my commit generator to a list. Hence, all commits are referenced, even after being analyzed.
Need to change that. Looks like it's a bit complicated, I'll need to put some work on it and my Python skills are a bit rusty these days.
Problem should be solved now, I will release a new version soon 😄 feel free to test it in master
Tested with the same code over the master repo, and now it works perfectly, thanks!