LLNL/scraper

Reading metadata from additional file

StephenQuirolgico opened this issue · 5 comments

@IanLee1521 - Can't recall if this was already requested elsewhere, but is it possible to enhance the scraper to also read metadata from an additional file in a repo? The rationale would be to allow developers to have more control over the metadata that is provided, and to provide metadata that may not be scraped by the scraper.

I think it would be helpful to read a code.json file in the root of the repo. During the GSA calls, at least two programs said they did something similar. I would like to bring this up on a GSA call and have them put out some guidance on code.gov to help shape the implementation here.

The local process we use on top of scraper is to read a code.json and use its values to override the project settings in the combined agency code.json. It's a bit of a hack, but it lets me use the exact same schema. We do this on the openCDC repo.

Certainly doable, I believe this was last on @jcastle's plate as there was to be a discussion in the bi-weekly calls (or other spin off calls) to figure out the best way to implement this. (and e.g. what to name the file).

Let's add this to the metadata brainstorm. Will send out an invite for that discussion to begin next week.

I will wait for the official answer from @jcastle / Amin but I propose that we name the file .code_gov.json and that it should have the same format as the “repository” object in the metadata schema (currently called “release”).

If it does, any fields that match what comes from the API will be replaced. Example from gsa.gov/code.json, where all the values are explicitly in the file:

{
      "contact": {
        "URL": "https://github.com/18F",
        "email": "18f@gsa.gov"
      },
      "date": {
        "created": "2013-07-17",
        "lastModified": "2019-05-02"
      },
      "description": "A hosted, shared-service that provides an API key, analytics, and proxy solution for government web services.",
      "downloadURL": "https://api.github.com/repos/18F/api.data.gov/downloads",
      "homepageURL": "https://github.com/18F/api.data.gov",
      "laborHours": 1216,
      "languages": [
        "HTML",
        "Ruby",
        "CSS",
        "JavaScript"
      ],
      "name": "api.data.gov",
      "organization": "18F",
      "permissions": {
        "licenses": [
          {
            "name": "NOASSERTION"
          }
        ],
        "usageType": "openSource"
      },
      "repositoryURL": "https://github.com/18F/api.data.gov",
      "status": "Development",
      "tags": [
        "github"
      ],
      "vcs": "git"
}

Example where only a couple fields (tags and contact:email) are overridden:

{
      "contact": {
        "email": "jcastle@gsa.gov"
      },
      "tags": [
        "github",
        "code_gov"
      ]
}

What do you all think of that?