LLNL/scraper

Support scraping Subversion

leebrian opened this issue · 6 comments

Being able to scrape subversion projects would be helpful and is not yet supported. It's a pretty low priority for my agency, but you requested we add issues for examples of repos not yet supported.

Hmm. Yeah, pretty low for me too. I'm trying to think how we might do this for arbitrary SVN repos (where we get the metadata itself).

Are there specific SVN hosting tools that we would specifically want / need to target?

I wonder if we can get a list of all of the repositoryURLs from code.gov (cc/ @RicardoAReyes) to try to find the hosting platforms to target..? Guess that is more justification for #29 ;)

I think we have about 100-200 projects or so but haven't counted yet since no one is really asking internally and since they aren't scraped properly to determine if they are excludable, it's a viscous cycle since people can't find them.

I'm not sure what hosting tools to target. I was reading through the svn book's api chapter and it seems like a crawl using the svn client to checkout every directory and then got through it to find history and comments and maybe enough metadata. I haven't looked at it since them because it seemed like a decent amount of boring work digging into svn history files and such.

I tried checking all the repos, but https://api.code.gov/repos?size=10000 only returned 1000 of the reported 6565 repos. None of those thousand had subversion and they were all vcs=git.

Hi, found the help-wanted tag for this issue on code.gov .

You can see more than 1000 repos at once by passing '&from=[start]' to the code api.
I used the api node.js module to check vcs= for all of them. 2496 don't have a vcs field, 200 have an empty string, 1 has 'zip', and the 3863 others are all some form of 'git'. That's all of them.

Here's a list of all the repository urls by repository ID: repository_ids_and_urls.txt

I tried doing a simple 'svn co ${url}' on each repositoryURL (so no authentication performed). It only worked on two projects: https://code.gov/projects/doe_office_scientific_technical_information_osti_1_kepler and https://code.gov/projects/doe_office_scientific_technical_information_osti_1_zeptoos . What did I miss?

What kind of metadata can you extract from those checkouts? Are you able to populate the code.json elements?

They're just source code repositories, containing branch and tag names, sourcefiles, and detailed change history, and that's it. Human intervention would be needed for many fields, but you could auto-populate things like vcs=svn and maybe offer guesses for things like releases, license, e-mail, or description, based on repo content. I'm guessing that even a barebones code.json file is helpful, here.