Notice: the tool and process is tailored to GitHub, but it should be straightforward to adapt it for other forge technologies
get-repos-info.py take a list of github repository urls and output relevant information in the form of a semi-colon separated lines containing the following fields canonicalurl: canonical url of the repository on GitHub number of commits: number of commits (provides estimate of size of repository) lastcommitdate: date of last commit on GitHub swhlastvisit: date of last visit from SWH visitstatus: status of the last visit as returned by SWH status: summary of repository status, can be one of the following UPTODATE, TOUPDATE, NOTINSWH, NOTINGITHUB, NOWPRIVATEONGITHUB isfork: ISFORK if this repository is an explicit fork forkurl: url of the repository that has been forked to create this one sourceurl: url of the original repository at the root of the fork chain stars: number of stars on github
From issue 5823 we get 28198 repository urls that failed git ingestion, in the file
fulldata.txt
; we want to get the relevant nonfork repositories to ingest.
We need bearer tokens for the GH API and the SWH API, we suppose they are stored
in the files github-token
and swh-api-token
respectively.
python3 get-repos-info.py -t `cat github-token` -a `cat swh-api-token` fulldata.txt > fulldata.data 2>fulldata.log
Extract the nonfork repositories that are still in GitHub, and sort them by number of stars
grep -v ISFORK fulldata.data | sed 's/;.*;/;/' | sort -t \; -k 2 -n -r | grep -v NOTINGITHUB > nonfork.list
Extract original repository from fork projects
grep ISFORK fulldata.data | sed 's/.*ISFORK;//' | sed 's/[^;]*;//' | sed 's/;.*//' | sort -u > forked.txt
Process the forked list, extract repos still to be archived, sort them by number of stars
python3 get-repos-info.py -t `cat github-token` -a `cat swh-api-token` forked.txt > forked.data 2>forked.log
egrep -v "NOTING|UPTODATE|NOWPRIVATE|TOUPDATE" forked.data | sed 's/;.*;/;/' | sort -t \; -k 2 -n -r
Merge with the nonfork.list
cat forked.list nonfork.list | sort -u > priority.list
Remove number of stars
sed 's/;.*//' priority.list > priority.list.github
Use priority.list.github
to feed the loaders