Missing lots of triples caused by overly selective query
fpriyatna opened this issue · 6 comments
Background: There exist records of drug interactions for MED-RT datasource as shown by this query: SELECT * FROM MRREL WHERE SAB = 'MED-RT' AND RELA = 'has_contraindicated_drug';
Issue: The main table that stores the UMLS concepts, MRCONSO, does not store the MED-RT code for those resources in background 1 query.
Cause: so the reason of why some UMLS properties do not appear in the triples is that because although those properties exist in the relationship table (MRREL), the associated concepts (CUI column) in the corresponding datasource (SAB column) are not stored in the concept table (MRCONSO).
- Original/Old Query: query that retrieves the list of CUIs from the concept (MRCONSO) table: SELECT DISTINCT CUI FROM MRCONSO WHERE SAB = 'MED-RT' ; → returns 3392 rows
- Proposed Patch/New Query: query that retrieves the list of CUIs from the relation (MRREL) table: SELECT DISTINCT CUI FROM (SELECT CUI1 AS CUI FROM MRREL WHERE SAB = 'MED-RT' UNION SELECT CUI2 AS CUI FROM MRREL WHERE SAB = 'MED-RT') V1; → returns 21941
The above example applies to other datasources as well, not just MED-RT.
Hi @fpriyatna
Thanks for your analysis.
I am skeptical on the "legitimity" of theses CUI, the ones found in MRREL and involved in a relation compared to the ones in MRCONSO. Typically, what information would you have for these CUIs if they are not in MRCONSO... the label (STR) or language (LAT) or code (CODE) or definiton in MRDEF.
UMLS2RDF is made to provide such information about a class.
Can you provide a few exemple of CUIs found in MRREL and not present in MRCONSO and then provide their label, language and other info?
Thank you for your quick reply @jonquet !
For example, I need to find some drug interaction information, in which that I can get from the MRREL table with this query: SELECT * FROM MRREL WHERE RELA = 'has_contraindicated_drug' ;
Notice that in this case, all of rows come from MED-RT datasource. Now let's take one of the results with cui C0000294. Apparently, in the MRCONSO table, there is no information for that cui from the MED-RT datasource. This results that in the triples generated by umls2rdf, there is information about the drug information, because the relevant query that is executed by umls2rdf is something like SELECT ... FROM MRCONSO WHERE SAB='MED-RT' AND ...
However, once we know the cui of the drug, we dont really need to get the concepts from the same datasource. Thus, in order to get the drug information, we can do something like SELECT * FROM MRCONSO INNER JOIN (SELECT DISTINCT CUI1 FROM MRREL WHERE SAB='MED-RT' UNION SELECT DISTINCT CUI2 FROM MRREL WHERE SAB='MED-RT') V1 ON MRCONSO.CUI = V1.CUI
and get the drug information in the triples.
What do do you think?
Now let's take one of the results with cui C0000294. Apparently, in the MRCONSO table, there is no information for that cui from the MED-RT datasource.
Indeed C0000294 is not in MED-RT.
Therefore, umls2rdf – which is a software to extract in a RDFS format an "ontology form" for a specific UMLS SAB – will not include C0000294 in an export of MED-RT.
=> This is the normal behaviour I believe.
Another way to say this : In the original MED-RT, there is no i definition of the relation you have found for this drug in the MRREL table, therefore it is normal it is not exported.
But UMLS is a meta-thesaurus. It provides relationships between concepts in difference source systems.
If umls2rdf only output the relationships that exist between concepts in the same source system then the tool should maybe be called "umls_source_systems_2rdf."
I don't think anyone would be upset if they used this tool and exported MED-RT as RDF and got triples that refer to concepts in other source systems. In fact, I think many users are looking for that.
At the very least I think the referenced PR should be enabled by a configuration item.
I understand your point @justin2004 Maybe indeed the system should have been called differently ;)
The idea that NCBO has when developing this tool was to consider UMLS as a "unified source" of different biomedical terminologies that could be loaded one by one in the NCBO BioPortal which is a project that has taken a "per ontology approach" not a "meta approach" as the UMLS did. This was debated several times years ago ... @mamusen can witness
Therefore, it's super important that despite the source information is taken out from UMLS Metathesaurus, its only what was in the original source that is included in the output of ums2rdf I do agree that some knowledge, coming from the tremendous amount of work done by the Metatheasurus is lost.
To avoid confusion, as much as possible every time a RDF source file exist outside of UMLS Metathesaurus (for a source that is also in the UMLS) BioPortal will usually load this file one rather than use the umls2rdf to extract it from UMLS. This is the cas in BioPortal for NCIT or GO.
Also just to clarify, the aforementioned concepts are in umls, but they are in relation table, not in concept table