NCEAS/metadig-checks

resource.URLs.resolvable

gothub opened this issue · 12 comments

Description

Any URL provided in the metadata (including abstract, location description, methods, and related references) resolves.

Priority

  • ESS-DIVE: Required

Issues

  • List of fields to extract URLs from: abstract, location description, methods, related references.
  • Note that this check could take a significant amount of time to execute.

Procedure

Extract URLs from abstract, location description, methods, and related references, and check that they are in a valid format and resolve. If unresolvable URLs are found, include up to 3 specific urls that were not resolvable in the check output. Also, the total number of URLs found and the total number that are not resolvable is printed.

The check will meter HTTP head requests if many URLs are found, so as to not overwhelm DataONE MNs or other servers.

ESS-DIVE may provide a list of valid domains to check URLs for. URLs in the metadata that are not in this list would not be checks for resolvability. Note that implementing this feature is dependant on the list being provided.

Requested response for failed check

One or more links provided in the metadata does not resolve correctly.

@gothub I updated this check with the requested information. Is this doable for implementation by March? @vchendrix

@JEDamerow yes, this can be delivered by March.

@JEDamerow @vchendrix I'd like to rename this to resource.URLs.resolvable to avoid confusion with
metadata.identifier.resolvable. The URLs being resolved are not for locating the metadata object, but for the resource in general. How does that sound?

@gothub Can we get some more information about how long this check may take and options so that it does not slow down the overall assessment report? Here are some questions/options that we discussed:

  • is it possible to run on a separate thread so that it doesn’t slow down the other checks?
  • How often is the check run? Can we control how often it is run?
  • can we do this once when dataset is first submitted, and subsequent URL checks offline (not integrated with UI(?

Output in failed assessment report: # of URLs that were not valid and print up to 5 links that were not valid

@gothub We still have questions above to make sure the check does not slow down assessment reports, etc. But, we will not include any list of accepted URLs at this time. This will just check that urls provided resolve.

@JEDamerow In the past, when other checks have been developed, a representative dataset would be tested and timed to come up with a worst case scenario regarding processing time.

If such a dataset isn't known, then an alternative, just for development, is to include in the check output, the number of unique URLs found and the elapsed processing time for the check. Once these are analyzed, they can be removed from the check for production.

Initial revision saved in commit 9504427

@JEDamerow Note that when no URLs are found in the designated metadata fields, the 'SUCCESS' message may be confusing - "No URLs were found in the metadata.". I'm not sure how to improve the wording here.

What about "Not applicable, because no urls were found in the abstract, location description, methods, or related references". Is that too long?

That message is fine. Is it OK for the check to 'PASS', even if they don't have any URLs?