/PrefixPing

prefix commons curie prefix checker

Primary LanguagePython

Prefix Ping

Goal

Check a number of life science registries to see if a string has been claimed as a namespace.

Specifics

Taking a two (hopefully not three) pronged approach

  • Easy
    If a registry implements their service in a way that the question

    Do you have a page for this prefix?

    can be answered with http status codes then the base URL is added to the local registry_url.txt

    note: A couple of them return a wrong (500 server error) code. We should try to get them fixed.
    Update We have extended the 'short circuit Y/N' approach to return a more comprehensive report whether the prefix has been found before or not so return code matter much less now.

  • Okay
    It the data file the site is generated from is available use that data directly. Currently this is in the form of yaml files from GO and CDL/EBI and covers about 1,000 prefixes (have not looked for overlap) db-xrefs.yaml cdl_ebi_prefixes.yaml

    Another could be added via a SPARQL query but unless there is a very cheap way to tell if the remote has been updated it may not be worth it.

  • Screen scraping
    It is expensive and a pain; hope to avoid as much as possible, so far only one source falls here and I have an email in to try & rectify

Service

Have a python/flask prefixping.py microservice working.

Start local flask server cli

  • export FLASK_APP=prefixping.py
  • export FLASK_DEBUG=1
  • flask run

should be running on http://127.0.0.1:5000/

API call is:

http://<host>/prefix/foo

returns a json blob with the source registries queried and the result of those query

Filtering

We want to promote sane prefixes, so as with xml Qnames they must begin with a letter and not contain a colon. Since CURIEs interchange colon with underscore for resolvability prefixes should not contain underscores either.

Dots are best left to delimit the version number at the end of a local-ID but there are legacy identifiers (and schemes) using them within curie prefixes now so they are grudging allowed.

In terms of prefix length; one letter is too short, a whole line is too long. Two letters is still pretty short but we have GO: (Gene Ontology) looking through the ~600 I have access too, they average 7 or 8 characters the longest is 33. which is where I am setting the initial size limit.

Case, mixed case can improve readability and is encouraged but it cannot be considered when deciding if a prefix is taken or exists. For sources we have access to or influence with searching over lowercased prefixes will be most efficient.

TODO: Still need to check/confirm behavior of remote systems we have no access to.