Lookup unknown compounds in wikidata
petermr opened this issue · 21 comments
compounds in oil186 not in dictionary
https://github.com/petermr/CEVOpen/blob/master/searches/oil186/__tables/notFound.txt
is a list of "new" compounds in oil186 papers.
create a 2-column table of these names and the wikidata identifiers where found.
(do not try to look up the explicit chemical names such as
tricyclo [4.4.0.0(2,7)] dec-3-ene-3-methanol, 1-methyl-8-(1-methylethyl)-
Sir, please go through the file for wikidata id - notFoundCompWIKIDATAPubChem.tsv.
Column description.
compound
- not found compounds.WIKIDATA_query_id
- WIKIDATA query string with wikidata id.notes
- notes for found entries.
Sir, please review the sheet for WIKIDATA lookup - notFoundCompWIKIDATAPubChem.tsv
Please form a template for PubChem lookup of compounds - What columns are to be included onto the sheet?
We have two available services -
Identifier exchange services will provide CIDs
OR InChIs
OR InChIKeys
OR SMILES
OR Synonyms
.
Download services will use retrieved CIDs
and provide IUPAC name
AND Synonyms
AND InChIKey
etc.
PubChem lookup will be two step process - one for retrieving CIDs from identifier exchange services and next one for getting compound lookup based on retrieved CIDs
through download services to get additional information about the compound.
Sir, revised sheet for WIKIDATA identifier addition - notFoundCompWIKIDATAPubChem.tsv.
Sir, revised sheet for WIKIDATA and PubChem lookup- notFoundCompWIKIDATAPubChem.tsv
I have cleaned compound names to support PubChem lookup.
**replace Greek letters with alpha-numeric characters.** e.g α -> alpha, β -> beta,γ -> gamma, δ -> delta etc.
Compound Example -
α-cedrene -> alpha-cedrene (PubChem CID - 6431015)
γ-eudesmol -> gamma-eudesmol (PubChem CID - 6432005)
β-gurjunene -> beta-gurjunene (PubChem CID - 6450812)
**Isomeric notations made into capital letters.** e.g (e,e) -> (E,E) ; (2e, 6z) -> (2E,6Z) etc.
Compound Example -
(z)-α-santalol -> (Z)-alpha-santalol ( PubChem CID - 11085337)
(e)-2-isopropyl-5-methylphenyl 2-methylbut-2-enoate -> (E)-2-Isopropyl-5-methylphenyl 2-methylbut-2-enoate (PubChem CID - 91698167)
(e)-β-ocimene -> (E)-beta-ocimene (Pubchem CID - 5281553)
**Proper hyphen notation.** (–) -> (-)
Compound Example -
(−)-spathulenol -> (-)-spathulenol
**Trimming extra white spaces**
Extracted count of records - 465
.
Yes sir. Next course of action is to go for normalization of the extracted compound names.
Please suggest to automate identification of compound synonyms
.
OK sir.
Sir, Please give an example for dictionary entry containing pubchem CID (instead of wikidata entry). Is it like <entry term="thymol" pubchem_cid="6989" />
?
Sir, please go through the updated sheet for WIKIDATA and PubChem lookup -notFoundCompWIKIDATAPubChem.tsv
Added column for cleaned compound names - Cleaned_cmpnames
Compound Cleaned_cmpnames
butanoic acid Butanoic acid
β-linalool beta-Linalool
p-cymene p-Cymene
n-hexanol n-Hexanol
hexan-1-ol Hexan-1-ol
Hello all,
I've been absent from the project for a while, sorry. I looked at the table and I can quickly make a script that:
-
Looks for PubChem_CIDs in Wikidata via SPARQL and retrieve associated QIDs
-
If QID does not exist on Wikidata, tag the entry in the
WIKIDATA_id
column as NOT FOUND.
This only for the already curated PubChem IDS. Would that be useful?
Hi,
I have committed sheet compound_multiset.tsv. Please go through it and make needful changes.
I will like to have good conversation over framing automated SPRQL query.
Sir,
Please go through the copy of compound_multiset.tsv
28
entries are not retrieved from PubChem
.
For example -
1,3-di-p-coumaroylglycerol | 2 | not found | not found | not found | not found |
2-acetyl-1,3-di-caffeoylglycerol | 2 | not found | not found | not found | not found |
2-acetyl-1,3-di-p-coumaroylglycerol | 2 | not found | not found | not found | not found |
2-acetyl-p-3-coumaroyl-1-feruloylglycerol | 2 | not found | not found | not found | not found |
2-acetylo-3-caffeoyl-1-feruloylglycerol | 2 | not found | not found | not found | not found |
p-coumaric acid benzyl ester | 2 | (e)-p-coumaric acid | Q99374 | not found | not found
92
entries are not retrieved from WIKIDATA
.
For example -
p-cymen-7-ol | 2 | not found | not found | 325 | 4-isopropylbenzyl alcohol |
kaurene | 2 | not found | not found | 520687 | kaurene
khusinol | 2 | not found | not found | 91746535 | khusinol
In batch retrieval of compound CIDs from PubChem identifier exchange services around 100 compound CIDs were left unretrieved (100 last most of entries). Those were retrieved manually.
For example -
p-cymenene
trans-gamma-bisabolene
trans-geraniol
trans-linalool oxide (furanoid)
trans-muurola-3,5-diene
trans-pinocamphone
trans-sabinol
More of the WIKIDATA lookups are based on compound synonyms ( not as of direct compound name).
For example -
calarene | 2 | beta-Gurjunene | Q27154913 | 28481 | beta-gurjunene
bisabolol | 2 | levomenol | Q179896 | 1549992 | bisabolol
There are isomers present into WIKIDATA lookup.
For example -
beta-chamigrene | 2 | (-)-beta-chamigrene | Q27108622 | 442353 | (-)-beta-chamigrene
beta-citronellol | 2 | (+/-)-.beta.-citronellol | Q27122080 | 8842 | citronellol
(-)-caryophyllene oxide | 2 | .beta.-caryophyllene oxide | Q27136294 | 1742210 | caryophyllene oxide
selina-3,7(11)-diene | 3 | .alpha.-selinene | Q7448480 | 10726905 | 7-epi-alpha-selinene
Hello,
I have made a pull request (#69) with the Wikidata matches for PubChemIDs in the compound_multisetCopy.tsv table.
I added 20 new QIDs.
Two notes:
1 - Some PubChem CIDs are duplicated the table. Ex: cis-calamenene (6429077) and calamenene (also 6429077). trans-calamenene is 6429022. I have not changed anything regarding the duplications. It is a consequence of synonymy, as said above.
2 - I was going to auto add the missing compounds to Wikidata, but I did not want to be too hasty. There is a nice Wikiproject focused on chemical IDs (https://www.wikidata.org/wiki/Wikidata:WikiProject_Chemistry/ChemID), I am planning to go through their docs as soon as possible.
The code is also added to a folder. As this is my first contribution in a while, I do not know, is there something I should have done differently?