petermr/CEVOpen

Lookup unknown compounds in wikidata

petermr opened this issue · 21 comments

compounds in oil186 not in dictionary

https://github.com/petermr/CEVOpen/blob/master/searches/oil186/__tables/notFound.txt
is a list of "new" compounds in oil186 papers.

create a 2-column table of these names and the wikidata identifiers where found.

(do not try to look up the explicit chemical names such as

tricyclo [4.4.0.0(2,7)] dec-3-ene-3-methanol, 1-methyl-8-(1-methylethyl)-

Sir, please go through the file for wikidata id - notFoundCompWIKIDATAPubChem.tsv.

Column description.

  • compound - not found compounds.
  • WIKIDATA_query_id - WIKIDATA query string with wikidata id.
  • notes - notes for found entries.

Sir, please review the sheet for WIKIDATA lookup - notFoundCompWIKIDATAPubChem.tsv

Please form a template for PubChem lookup of compounds - What columns are to be included onto the sheet?

We have two available services -

Identifier exchange services will provide CIDs OR InChIs OR InChIKeys OR SMILES OR Synonyms.
Download services will use retrieved CIDs and provide IUPAC name AND Synonyms AND InChIKey etc.

PubChem lookup will be two step process - one for retrieving CIDs from identifier exchange services and next one for getting compound lookup based on retrieved CIDs through download services to get additional information about the compound.

Sir, revised sheet for WIKIDATA identifier addition - notFoundCompWIKIDATAPubChem.tsv.

Sir, revised sheet for WIKIDATA and PubChem lookup- notFoundCompWIKIDATAPubChem.tsv

I have cleaned compound names to support PubChem lookup.

**replace Greek letters with alpha-numeric characters.** e.g α -> alpha, β -> beta,γ -> gamma, δ -> delta etc.

Compound Example - 

α-cedrene -> alpha-cedrene  (PubChem CID - 6431015) 
γ-eudesmol -> gamma-eudesmol (PubChem CID - 6432005)
β-gurjunene -> beta-gurjunene (PubChem CID - 6450812) 


 **Isomeric notations made into capital letters.**  e.g (e,e) -> (E,E) ; (2e, 6z) -> (2E,6Z) etc.

Compound Example -

(z)-α-santalol -> (Z)-alpha-santalol ( PubChem CID - 11085337)
(e)-2-isopropyl-5-methylphenyl 2-methylbut-2-enoate -> (E)-2-Isopropyl-5-methylphenyl 2-methylbut-2-enoate (PubChem CID - 91698167)
(e)-β-ocimene -> (E)-beta-ocimene (Pubchem CID - 5281553) 

**Proper hyphen notation.**  (–) -> (-)

Compound Example - 

(−)-spathulenol -> (-)-spathulenol

**Trimming extra white spaces**

Extracted count of records - 465.

Yes sir. Next course of action is to go for normalization of the extracted compound names.

Please suggest to automate identification of compound synonyms.

OK sir.

Sir, Please give an example for dictionary entry containing pubchem CID (instead of wikidata entry). Is it like <entry term="thymol" pubchem_cid="6989" />?

Sir, please go through the updated sheet for WIKIDATA and PubChem lookup -notFoundCompWIKIDATAPubChem.tsv

Added column for cleaned compound names - Cleaned_cmpnames

Compound 	Cleaned_cmpnames 	
butanoic acid 	    Butanoic acid 	
β-linalool 	            beta-Linalool      
p-cymene 	     p-Cymene      
n-hexanol 	     n-Hexanol 	
hexan-1-ol 	     Hexan-1-ol    

Hello all,

I've been absent from the project for a while, sorry. I looked at the table and I can quickly make a script that:

  • Looks for PubChem_CIDs in Wikidata via SPARQL and retrieve associated QIDs

  • If QID does not exist on Wikidata, tag the entry in the WIKIDATA_id column as NOT FOUND.

This only for the already curated PubChem IDS. Would that be useful?

Hi,
I have committed sheet compound_multiset.tsv. Please go through it and make needful changes.

I will like to have good conversation over framing automated SPRQL query.

Sir,
Please go through the copy of compound_multiset.tsv

28 entries are not retrieved from PubChem.

For example -


1,3-di-p-coumaroylglycerol  | 2 |  not found  |  not found  | not found  |  not found |  

2-acetyl-1,3-di-caffeoylglycerol  | 2 |  not found |  not found |  not found |  not found |  

2-acetyl-1,3-di-p-coumaroylglycerol  | 2 |  not found |  not found |  not found |  not found |  

2-acetyl-p-3-coumaroyl-1-feruloylglycerol  | 2 |  not found  | not found |  not found |  not found |  

2-acetylo-3-caffeoyl-1-feruloylglycerol  | 2 |  not found  | not found  |  not found  |  not found |  



p-coumaric acid benzyl ester  | 2 |  (e)-p-coumaric acid  |  Q99374  |  not found  |  not found

92 entries are not retrieved from WIKIDATA.

For example -


p-cymen-7-ol  | 2 |  not found  |  not found  |  325  |  4-isopropylbenzyl alcohol |  


kaurene  | 2 |  not found  |  not found  |  520687  |  kaurene

khusinol  | 2 |  not found |  not found |  91746535  | khusinol

In batch retrieval of compound CIDs from PubChem identifier exchange services around 100 compound CIDs were left unretrieved (100 last most of entries). Those were retrieved manually.

For example -

p-cymenene
trans-gamma-bisabolene
trans-geraniol
trans-linalool oxide (furanoid)
trans-muurola-3,5-diene
trans-pinocamphone
trans-sabinol

More of the WIKIDATA lookups are based on compound synonyms ( not as of direct compound name).

For example -


calarene |  2  |  beta-Gurjunene  |  Q27154913  |  28481  |  beta-gurjunene


bisabolol  | 2 |  levomenol  | Q179896  |  1549992  |  bisabolol

There are isomers present into WIKIDATA lookup.

For example -


beta-chamigrene  | 2 |  (-)-beta-chamigrene  | Q27108622  |  442353  |  (-)-beta-chamigrene

beta-citronellol  | 2 |  (+/-)-.beta.-citronellol  |  Q27122080  |  8842  | citronellol


(-)-caryophyllene oxide  | 2 |  .beta.-caryophyllene oxide  | Q27136294  | 1742210  | caryophyllene oxide


selina-3,7(11)-diene  | 3 |  .alpha.-selinene  | Q7448480  | 10726905  | 7-epi-alpha-selinene

Hello,

I have made a pull request (#69) with the Wikidata matches for PubChemIDs in the compound_multisetCopy.tsv table.

I added 20 new QIDs.

Two notes:
1 - Some PubChem CIDs are duplicated the table. Ex: cis-calamenene (6429077) and calamenene (also 6429077). trans-calamenene is 6429022. I have not changed anything regarding the duplications. It is a consequence of synonymy, as said above.

2 - I was going to auto add the missing compounds to Wikidata, but I did not want to be too hasty. There is a nice Wikiproject focused on chemical IDs (https://www.wikidata.org/wiki/Wikidata:WikiProject_Chemistry/ChemID), I am planning to go through their docs as soon as possible.

The code is also added to a folder. As this is my first contribution in a while, I do not know, is there something I should have done differently?