Regex for BioSample doesn't validate example
Closed this issue · 5 comments
I'm referring to the regex:
^SAM[NED](\w)?\d+$
in
https://registry.identifiers.org/registry/biosample
that can't match SAMEA2397676
due to the second A (SAME-->A<--2397676
) in the LUI
I would propose to implement some type of automation pipeline that validates the examples against the provided regular expressions as this seems to be a recurrent issue.
Hi @athalhammer, you might have noticed from the issue tracker that the Identifiers.org isn't really able to respond anymore. I'd suggest checking out the Bioregistry project (https://github.com/biopragmatics/bioregistry and https://bioregistry.io) for something similar that's being actively maintained and encourages community feedback.
With respect to your question, I think this works alright on Identifiers.org with https://identifiers.org/biosample:SAMEA2397676 as their page suggests - the A
that you're pointing out in your comment seems to get matched to (\w)?
which lets you have an optional letter following the SAME
before the 2397676
.
This also works fine on the Bioregistry at https://bioregistry.io/biosample:SAMEA2397676
Also FYI the Bioregistry is 100% open source and open data, so it's able to implement much more detailed CI to make sure exactly stuff like this is consistent. For example, the following code makes sure that all example identifiers match the regular expressions for each record:
Thanks @cthoyt, you are completely right! I misinterpreted the optional\w
character. Thanks also for all the additional pointers!
@athalhammer Please feel free to get in touch on the Bioregistry issue tracker or @ me if you find that something's missing from either service! I am not myself affiliated with Identifiers.org but am one of the developers/maintainers of Bioregistry. We also just put a preprint last week (https://www.biorxiv.org/content/10.1101/2022.07.08.499378v2) 🚀