Docetaxel Chembl term getting mapped to Docetaxel Trihydrate Pubchem
Opened this issue · 2 comments
The CHEMBL curie CHEMBL.COMPOUND:CHEMBL3545252 gets mapped to PUBCHEM.COMPOUND:148123 (which is Docetaxel Trihydrate).
If you put Docetaxel into name_resolver, you get back PUBCHEM.COMPOUND:148124, which is the term for Docetaxel, and seems like a better place to resolve to. Similarly, if you run NodeNormalizer on DrugCentral:939, it also resolves to PUBCHEM.COMPOUND:148124.
I think an easy solution for this is to have PUBCHEM.COMPOUND:148123 (Docetaxel Trihydrate) resolve to PUBCHEM.COMPOUND:148124 (Docetaxel).
This is also the state on NodeNorm Dev. The good news is that if you turn on drug conflation, these two cliques are merged. The bad news is that we currently use a simple rule to choose which PUBCHEM.COMPOUND to use to represent a drug conflated clique, which is to choose the one with the smallest CURIE suffix. Since 148123 < 148124, Docetaxel Trihydrate is used to represent this entire clique.
So, I think there are two big questions here:
- Why is CHEMBL.COMPOUND:CHEMBL3545252 being merged into PUBCHEM.COMPOUND:148123 rather than PUBCHEM.COMPOUND:148124, when the former is the trihydrate form and the latter is the salt? This is easy to answer: we clique chemicals using their InChI keys, and the first two concepts have the same InChI key, while PUBCHEM.COMPOUND:148124 has a different InChI key. There might be a better way to do this.
- Can we choose PUBCHEM.COMPOUND:148124 as the clique leader? One way of doing this might be to choose the one with the shortest label, but that will need some thinking about as well.
Could we use the fact that PUBCHEM.COMPOUND:148123 lists PUBCHEM.COMPOUND:148124 as it's parent compound? (That is Docetaxel Trihydrate states that it's parent is Docetaxel)