Documentation for reconstructing corpus
VHellendoorn opened this issue · 4 comments
Hi,
Upon trying to recreate your dataset, I found both the documentation and code for recreating the corpus from scratch to have some issues, particularly related to the create_repos file.
Firstly, passing --dbFile=data/cloned_repos.dat
as described throws an error since this causes scraper.py to look for a local directory "data", which does not exist. I suppose this was meant to be --dbFile=../data/cloned_repos.dat
Having fixed that, the code fails on pickle.load(infile)
(line 60) with the message "TypeError: a bytes-like object is required, not 'str'", which seems to refer to the file cloned_repos.dat being opened as a plain-text file when it actually is in some other format, presumably binary data. However, changing line 59 to ... = open(dbFile, 'rb')
(which opens the file as bytes) results in a UnicodeDecodeError, perhaps due to it having been pickled under Python 2 and this being incompatible with Python 3.
Please let me know if you can recreate this problem and, if possible, upload a new cloned_repos.dat file which is compatible with Python 3 (provided that is the issue). I would love to work with your dataset :)
Kind regards,
Vincent
Hi Vincent,
Thanks for describing the issues you've had with recreating our corpus. I'll look into these and get back to you as soon as possible!
Hi Vincent,
You are correct, the cloned_repos.dat file was serialized with Python 2.7. I've now updated it to a version compatible with Python3. I've also fixed the bug in scraper.py, which as you noted needed 'wb' in the file open. Finally, I've updated the documentation to indicate that the full path to dbfile is required because os.chdir(outputDirectory) is called before create_repos so it ends up looking for the dbfile path relative to the outputDirectory.
Thank you again for pointing out this issue, I hope it is now resolved and that you are able to recreate our dataset :) Please do let me know if you have any further issues
Regards,
Avishkar
Thanks a lot, looks like it works!
Hi again,
Thank you for your help previously. Unfortunately I am encountering several issues going forward and currently do not have time to examine them up close. If you have time in the future, please consider trying to reconstruct the corpus from scratch and improving the documentation where needed. Among others:
- I am getting an exception in the scraper for certain projects where it can no longer find the commit (tree) corresponding to certain SHAs,
- I end up with both more projects and fewer files than documented in the paper (which may be due to it checking out newer versions of said projects?)
- The normalisation.py script fails as it can't find any of the files, presumably because it doesn't store the absolute path to them on line 22.
Sorry to bother you with this again; if in the future you have time to improve this I would love to recreate your corpus for an upcoming paper, but please do not feel obligated to hurry. If you could share the actual processed corpus with me, that too would be excellent but I understand if you would prefer not to :)
Kind regards,
Vincent