Proposed directory structure

Question

Proposed directory structure

Closed this issue 2 years ago · 29 comments

mejackreed commented 10 years ago

This is from @drh-stanford

/stanford/druid/{iso19139.xml,iso19110.xml,preview.jpg}
/tufts/???/fgdc.[xml or txt]
/nypl/???/???
etc.

mejackreed commented 10 years ago

👍

Answer 1 · 2014-11-06T21:26:48.000Z

cc @waynegraham what do you think?

Answer 2 · 2014-11-06T21:35:50.000Z

I think it's straight forward enough...with sub directories as appropriate. Ours will change as we get the Fedora models in place.

Answer 3 · 2014-11-06T21:48:35.000Z

Also see #4 ... maybe just institutional repositories is the way to go

Answer 4 · 2014-11-06T22:29:14.000Z

Just so I understand the structure correctly....Is there a directory for each "layer"/cataloged object that contains different formats/associated resources for that object? Also, might there be an advantage to mirroring parent/child relationships in directory structure?

Answer 5 · 2014-11-06T22:35:16.000Z

Also, I don't know much about the GitHub API. How easy would it be to have an automated process to pull files directly from multiple git repositories into a database/index, either scheduled or triggered by commit? Seems like it would be straightforward.

Answer 6 · 2014-11-06T22:41:01.000Z

You might also want some institutional "metadata" in the institution's root. Logos, contact info, portal urls, etc.

Answer 7 · 2014-11-06T22:42:45.000Z

Too much categorization built in to the directory structure is probably not a good thing, but are there some basic, universal categories that might be helpful? raster/vector/scanned maps?

Answer 8 · 2014-11-07T00:33:32.000Z

I think the directory structure doesn't need raster/vector distinguishing folders because you can parse the metadata to figure out what the type is. It should be straightforward to build a lightweight rails app to put a semantic view onto the repo.

The directory structure is basically flat with a folder per institution and a sub-folder for each layer.

stanford/layer-id/iso19139.xml,fgdc.xml,...
tufts/layer-id/fgdc.xml,...
...

As for the GitHub API, I think the git protocol is the main one. So, to mirror, you'd just do a git clone or git pull as appropriate.

Having some institutional description files might be helpful -- maybe a simple README.txt that we can easily parse.

Answer 9 · 2014-11-07T02:10:23.000Z

I created a Stanford repository for our metadata https://github.com/OpenGeoMetadata/edu.stanford.purl using our unique identifier prefix. What do you think? Then maybe use this shared repository as a way to provide examples on ways metadata can be modeled?

Answer 10 · 2014-11-07T02:16:14.000Z

@chrissbarnett I think there are some cool opportunities to build awesome stuff off of the Github api. For instance a button in the metadata creator that can automatically create a pull request to a respective repository.

Answer 11 · 2014-11-07T04:16:00.000Z

@mejackreed Is Reverse-DNS (thank you Wikipedia) common in this context? I've never seen it outside of Java style classes. Seems tidy and good for sorting. Princeton's would be edu.princeton.arks.

Answer 12 · 2014-11-07T04:21:57.000Z

This is something that @kimdurante is doing with our ISO metadata. Not sure if it's the way to go but a name spacing proposal. I know that Berkeley is also using arks. @kimdurante can you weigh in? It may make sense to just use "stanford". We would probably want to utilize whatever is decided upon for GeoBlacklight as an items static url.

Answer 13 · 2014-11-07T06:48:38.000Z

we use edu.stanford.purl:layer-id as our uuids. We probably should add this structure to the top of the file hierarchy... edu/stanford/purl/layer-id or edu/princeton/ark/layer-id to preserve the unique identifiers.

Answer 14 · 2014-11-07T12:52:47.000Z

@drh-stanford do you think it is good enough to have the name spacing happen at the repository level?

opengeometadata/edu.stanford.purl/layer-id/{iso}.xml

or within an institution's repository should this also be represented

opengeometadata/edu.stanford.purl/edu/stanford/purl/layer-id/{iso}.xml

I think the first one would simplify harvesting.

Answer 15 · 2014-11-07T16:00:48.000Z

@drh-stanford re: categories, I was thinking more for use by others. For our use, I have no problem with a flat structure.

@mejackreed, @drh-stanford I don't think you should build too much namespacing into the directory structure. Different institutions will have different ways of namespacing and then you've made harvesting each repository slightly different.

Answer 16 · 2014-11-07T16:30:14.000Z

@mejackreed I love the idea of the pull request button. I would imagine that an admin could also have a merge pull request button.

Answer 17 · 2014-11-07T16:54:22.000Z

I agree with @mejackreed 's first example:
opengeometadata/edu.stanford.purl/layer-id/{iso}.xml
is a more simplified structure and will still allow us to to serve out multiple formats.

As for metadata file identifiers in ISO , we have chosen to follow these recommendations:
http://www.ngdc.noaa.gov/wiki/index.php?title=Data_Set_Identifiers_and_other_Unique_IDs

A new feature in the forthcoming ISO 19115-1 standard is the inclusion of naming authorities for FileIdentifiers. The structure of this field is similar to the way projection codes are referenced by a namespace (most often: EPSG).

fileIdentifier: edu.stanford.purl:dg850pt1796

Answer 18 · 2014-11-07T18:05:47.000Z

I think that this is fine where the top-level is the namespace for the institution -- the goal is that given a UUID like edu.stanford.purl:dg850pt1796 you should be able to directly locate whether that layer exists in the repository.

opengeometadata/edu.stanford.purl/dg850pt1796/iso19139.xml

Answer 19 · 2014-11-10T20:31:09.000Z

given the scale of the repos (10,000+ layers, etc.) we should probably provide for optional mapping support from layer UUIDs to the layer folder. For example, in our case:

opengeometadata/edu.stanford.purl/dg850pt1796/iso19139.xml

would be in a sub-directory tree to provide for scale:

opengeometadata/edu.stanford.purl/dg/850/pt/1796/iso19139.xml

To implement this we could use a simple UUID->folder mapping, like so:

layers.json:

  {
    "edu.stanford.purl:dg850pt1796":"edu.stanford.purl/dg/850/pt/1796",
    "org.example:123":"org.example/some/random/scheme/123",
   ...
  }

where layers.json is in the top-level institution directory and is optional (with the default being to look in the institution directory for a folder named UUID.

Answer 20 · 2014-11-10T22:27:35.000Z

I've put an example of this structure in https://github.com/OpenGeoMetadata/edu.stanford.purl

Answer 21 · 2014-11-11T16:31:28.000Z

In layers.json does it make sense to also include what formats are available for each layer? Or is this too verbose? I imagine most institutions will provide metadata in whatever format they have standardized on.

Answer 22 · 2014-11-11T17:12:33.000Z

@mejackreed My feeling is that putting available formats in layers.json is too verbose. Perhaps a json manifest in each directory? If the filenames are standardized, is it reasonable to expect that a harvester could get that from parsing the documents in the directory itself? In our case, our available formats will be different depending on the type of data. FGDC for datasets. MODS or DC for scanned maps.

Answer 23 · 2014-11-11T18:11:04.000Z

Any metadata about the layers can be found in the layer's folder using the ISO or FGDC or MODS or whatever format they have (and the harvester carries that burden to figure it out the formats). layers.json is meant to be as minimal as possible and for organization only. I think we can have a semantic layer through a rails app that can provide a harvesting meta-interface, but the structural repository should be as simple as possible imho.

Answer 24 · 2014-11-18T17:41:43.000Z

maybe a simple README.txt that we can easily parse.

👍 to this: right now the contents of this repository are difficult for an outsider to understand and quick markdown/whatever readmes with links to standards and websites would help tremendously.

Answer 25 · 2014-11-18T18:13:11.000Z

we've added a README for this repo that gives instructions and whatnot: #9

the simple parseable README.txt was to extract properties of an institution, such as their name, email contact, or whatever else we might need. it would sit next to the layers.json -- for example https://github.com/OpenGeoMetadata/edu.stanford.purl/blob/master/README

Answer 26 · 2014-11-18T18:15:04.000Z

If it's parsable, I don't think it should be a readme - that's metadata, and should be in a proper data format, like json or xml. READMEs are for humans, and should be freeform enough to contain details you don't expect they'll need.

Answer 27 · 2014-11-18T21:04:22.000Z

I agree with @tmcw that the READMEs should be for humans. I'd favor JSON for institutional metadata. (Why doesn't anyone ever call it a "PARSEME"?)

Regarding the need for subdirectories to accomodate large repos with 10,000+ layers, are there any hard limits or soft recommendations for max files per directory? (based on git, github, or any other systems in use)

Answer 28 · 2014-11-18T21:18:33.000Z

the limits are performance-related not filesystem-related -- i.e., file systems can accomodate a great many files per directory. i'm not sure what the soft limits are for optimum performance but my guess would be fairly small like ~1000 files per directory or ~100 KB for the directory inodes.