OpenGeoMetadata/shared-repository

Dealing with Duplicate Metadata

Opened this issue · 4 comments

While the short-term goal may be to expose our multi-institutional metadata holdings to a wider audience, the issue of harvesting all the metadata from one institution into another institution's portal will likely result in duplicate records being indexed within that instance When we discussed creation of a metadata repository last year, it was envisioned that it might serve as the 'database of record' for multi-institutional holdings, much like the OCLC bibliographic service.

We have somewhat of an issue with this now as our geoportal currently contains +/-200 records from Harvard for a restricted collection. If I submit our metadata for these same data into our repo, we will have 2 records for each layer, one of which is restricted to our users.

Although such a process could be managed by each individual institution, the idea of manually identifying hundreds of duplicate layers does not seem sustainable or fun.

It might be worth considering a strategy that identifies these aggregated metadata as duplicate. I am curious to everyone's thoughts on this as there are probably several ways to deal with it.

There are certain fields that are auto-populated from the data layer, such as geographic extent, spatial object information, and attribute info. Perhaps, designating such fields as matching keys we could begin to compare and evaluate for duplicates. Other fields, such as those that are free text might have to undergo some sort of normalization (lowercase, remove punctuation, etc.) in order to be used effectively.

I am sometimes wary that this topic is out-of-scope. However, the lack of a centralized database as well as file identifiers has created a situation where the same record is localized and then disseminated widely with no indication as to the source. It would also help us to coordinate cataloging efforts across organizations in a more mechanized way.

Just a thought...

I think that these might be separate, but definitely related issues. Having duplicate or near duplicate records is problematic, but also gives you information about holdings, which can be important especially for license-restricted layers. In most instances, if a layer is restricted for one institution, it will be for others, I would think. Additionally, restricted records are still useful as they let know researchers and librarians know how to purchase/license the data themselves if they need it.

Rather than eliminate duplicates, perhaps a better strategy is to find some way to relate them. Ideally we could do this dynamically if we can put together a good similarity algorithm with the fields that you mention.

Once people start using metadata from the repository they can reference the record they used. Is there a place for metadata provenance in ISO 19115/19139 and/or FGDC? We could build that into the terms of use. Also, any tools that we build can automatically insert the source record.

Using a record as a source for another metadata record doesn't guarantee that it's a duplicate though.

Most often, I've seen provenance information included in the Lineage section. This might be more applicable to data provenance but I've seen references to metadata imports/transforms in there as well.

Interestingly, the new ISO model splits out this section into its own standard (ISO19157).
We might consider creating 1 component 19157 XML record for each metadata providing institution. Since component records can have their own uuids, this uuid of the provider record could be referenced in the attributes of the distributed 19139 XML, similar to the way that uuids are referenced for the 19110 records now.
This might reduce the burden of having to fill out lengthy citations in this section for future users and the component metadata come in a standardized package.

I agree with your point re:relating records vs. eliminating. I keep going back to the centralized model of there being 1 source record by which all institutions use and attach their holdings information.
I suppose there would need to be a quality assessment on existing duplicates to determine what is 'the source record."

We might also look at expressing provenance through the metadata fileidentifier, which if we all use namespace authorities, would provide provenance information for each institution. But, I'm not sure if that's too simplistic. If later changes were made to the source record, the fileidentifier would remain the same but subsequent editors could be cited in the metadata contact information indicating who did what, and when...