Generic utility for querying existing BioThings APIs to retrieve canonical identifier (`_id`) used in specific APIs

Question

Generic utility for querying existing BioThings APIs to retrieve canonical identifier (`_id`) used in specific APIs

zcqian opened this issue 3 years ago · 0 comments

When developing additional data source uploads (or data plugins), the identifiers used in the source may not match the identifier used to join documents together.

Previously each uploader had to implement this functionality separately, for instance MyChem mostly uses datatransform module here which queries the MongoDB collections where other data is stored. Some MyDisease plugins queried MyDisease for the primary _id as shown here.

The downside of using datatransform is that it performs a lot of queries and the exact behavior is not well documented, and has a heavy dependency on MongoDB (using the BioThings APIs is not implemented in practice).

On the other hand querying each service, either bundled within BioThings SDK or doing it separately, introduces a chicken and egg problem: the API must be up before querying is possible, thus using it makes bootstrapping impossible or it may require doing the upload-build-release-install process at least twice to get most up to date data, as each time the identifier is retrieved using data from a previous release.

Either way, before BioThings SDK is capable of building documents by joining on arbitrary fields (i.e. not limited to joining on _id), we should provide a well-documented standard interface for this type of lookups.