OpenGeoMetadata/shared-repository

Proposed directory structure

Closed this issue · 29 comments

This is from @drh-stanford

/stanford/druid/{iso19139.xml,iso19110.xml,preview.jpg}
/tufts/???/fgdc.[xml or txt]
/nypl/???/???
etc.

cc @waynegraham what do you think?

I think it's straight forward enough...with sub directories as appropriate. Ours will change as we get the Fedora models in place.

Also see #4 ... maybe just institutional repositories is the way to go

Just so I understand the structure correctly....Is there a directory for each "layer"/cataloged object that contains different formats/associated resources for that object? Also, might there be an advantage to mirroring parent/child relationships in directory structure?

Also, I don't know much about the GitHub API. How easy would it be to have an automated process to pull files directly from multiple git repositories into a database/index, either scheduled or triggered by commit? Seems like it would be straightforward.

You might also want some institutional "metadata" in the institution's root. Logos, contact info, portal urls, etc.

Too much categorization built in to the directory structure is probably not a good thing, but are there some basic, universal categories that might be helpful? raster/vector/scanned maps?

I think the directory structure doesn't need raster/vector distinguishing folders because you can parse the metadata to figure out what the type is. It should be straightforward to build a lightweight rails app to put a semantic view onto the repo.

The directory structure is basically flat with a folder per institution and a sub-folder for each layer.

stanford/layer-id/iso19139.xml,fgdc.xml,...
tufts/layer-id/fgdc.xml,...
...

As for the GitHub API, I think the git protocol is the main one. So, to mirror, you'd just do a git clone or git pull as appropriate.

Having some institutional description files might be helpful -- maybe a simple README.txt that we can easily parse.

I created a Stanford repository for our metadata https://github.com/OpenGeoMetadata/edu.stanford.purl using our unique identifier prefix. What do you think? Then maybe use this shared repository as a way to provide examples on ways metadata can be modeled?

@chrissbarnett I think there are some cool opportunities to build awesome stuff off of the Github api. For instance a button in the metadata creator that can automatically create a pull request to a respective repository.

@mejackreed Is Reverse-DNS (thank you Wikipedia) common in this context? I've never seen it outside of Java style classes. Seems tidy and good for sorting. Princeton's would be edu.princeton.arks.

This is something that @kimdurante is doing with our ISO metadata. Not sure if it's the way to go but a name spacing proposal. I know that Berkeley is also using arks. @kimdurante can you weigh in? It may make sense to just use "stanford". We would probably want to utilize whatever is decided upon for GeoBlacklight as an items static url.

we use edu.stanford.purl:layer-id as our uuids. We probably should add this structure to the top of the file hierarchy... edu/stanford/purl/layer-id or edu/princeton/ark/layer-id to preserve the unique identifiers.

@drh-stanford do you think it is good enough to have the name spacing happen at the repository level?

opengeometadata/edu.stanford.purl/layer-id/{iso}.xml

or within an institution's repository should this also be represented

opengeometadata/edu.stanford.purl/edu/stanford/purl/layer-id/{iso}.xml

I think the first one would simplify harvesting.

@drh-stanford re: categories, I was thinking more for use by others. For our use, I have no problem with a flat structure.

@mejackreed, @drh-stanford I don't think you should build too much namespacing into the directory structure. Different institutions will have different ways of namespacing and then you've made harvesting each repository slightly different.

@mejackreed I love the idea of the pull request button. I would imagine that an admin could also have a merge pull request button.

I agree with @mejackreed 's first example:
opengeometadata/edu.stanford.purl/layer-id/{iso}.xml
is a more simplified structure and will still allow us to to serve out multiple formats.

As for metadata file identifiers in ISO , we have chosen to follow these recommendations:
http://www.ngdc.noaa.gov/wiki/index.php?title=Data_Set_Identifiers_and_other_Unique_IDs

A new feature in the forthcoming ISO 19115-1 standard is the inclusion of naming authorities for FileIdentifiers. The structure of this field is similar to the way projection codes are referenced by a namespace (most often: EPSG).

fileIdentifier: edu.stanford.purl:dg850pt1796

I think that this is fine where the top-level is the namespace for the institution -- the goal is that given a UUID like edu.stanford.purl:dg850pt1796 you should be able to directly locate whether that layer exists in the repository.

opengeometadata/edu.stanford.purl/dg850pt1796/iso19139.xml

👍

given the scale of the repos (10,000+ layers, etc.) we should probably provide for optional mapping support from layer UUIDs to the layer folder. For example, in our case:

opengeometadata/edu.stanford.purl/dg850pt1796/iso19139.xml

would be in a sub-directory tree to provide for scale:

opengeometadata/edu.stanford.purl/dg/850/pt/1796/iso19139.xml

To implement this we could use a simple UUID->folder mapping, like so:

layers.json:

  {
    "edu.stanford.purl:dg850pt1796":"edu.stanford.purl/dg/850/pt/1796",
    "org.example:123":"org.example/some/random/scheme/123",
   ...
  }

where layers.json is in the top-level institution directory and is optional (with the default being to look in the institution directory for a folder named UUID.

I've put an example of this structure in https://github.com/OpenGeoMetadata/edu.stanford.purl

In layers.json does it make sense to also include what formats are available for each layer? Or is this too verbose? I imagine most institutions will provide metadata in whatever format they have standardized on.

@mejackreed My feeling is that putting available formats in layers.json is too verbose. Perhaps a json manifest in each directory? If the filenames are standardized, is it reasonable to expect that a harvester could get that from parsing the documents in the directory itself? In our case, our available formats will be different depending on the type of data. FGDC for datasets. MODS or DC for scanned maps.

Any metadata about the layers can be found in the layer's folder using the ISO or FGDC or MODS or whatever format they have (and the harvester carries that burden to figure it out the formats). layers.json is meant to be as minimal as possible and for organization only. I think we can have a semantic layer through a rails app that can provide a harvesting meta-interface, but the structural repository should be as simple as possible imho.

tmcw commented

maybe a simple README.txt that we can easily parse.

👍 to this: right now the contents of this repository are difficult for an outsider to understand and quick markdown/whatever readmes with links to standards and websites would help tremendously.

we've added a README for this repo that gives instructions and whatnot: #9

the simple parseable README.txt was to extract properties of an institution, such as their name, email contact, or whatever else we might need. it would sit next to the layers.json -- for example https://github.com/OpenGeoMetadata/edu.stanford.purl/blob/master/README

tmcw commented

If it's parsable, I don't think it should be a readme - that's metadata, and should be in a proper data format, like json or xml. READMEs are for humans, and should be freeform enough to contain details you don't expect they'll need.

I agree with @tmcw that the READMEs should be for humans. I'd favor JSON for institutional metadata. (Why doesn't anyone ever call it a "PARSEME"?)

Regarding the need for subdirectories to accomodate large repos with 10,000+ layers, are there any hard limits or soft recommendations for max files per directory? (based on git, github, or any other systems in use)

the limits are performance-related not filesystem-related -- i.e., file systems can accomodate a great many files per directory. i'm not sure what the soft limits are for optimum performance but my guess would be fairly small like ~1000 files per directory or ~100 KB for the directory inodes.