doi.org-10.1101-2020.04.03.023846-with-smiles.csv is generated by the populate-smiles.py script. It generates the SMILES, ChEMBL IDs, and ChEMBL compound names, and ChEMBL max_phase (the highest clinical trial phase achieved) for the drug repurposing screen done in https://www.biorxiv.org/content/10.1101/2020.04.03.023846v1 The supplemental information from the original paper only provides an excel file with compound names and an image for the structure so is not suitable as a computational dataset. populate-smiles.py uses this compound name to query chembl, if it can't be found, it tries harder, even using the NIH resolver server to help if it has to. A postgres instance of ChEMBL is recommended, the sqlite version of ChEMBL is extremely slow in comparison. If multiple forms are found in ChEMBL it ensures they are just separate salt forms or isomers and merges them together, so multiple ChEMBL records may be linked. This was able to resolve 1504 of the 1520 compounds listed in the file. 14 of the 16 missing could be indeed be resolved by the NIH resolver, but still couldn't be found in chembl. These can be found in the unknown-to-chembl.csv, loosening the match criteria may find a way to get these to match, but I would rather get this data out. They also mostly look like aweful non-drug-like compounds anyway. Note, there are two cases when the name wasn't unique in the original Excel file: - Loracarbef: different stereoisomers are drawn in the excel, but chembl only has a single stereoisomer available, but two salt forms, so potentially a ChEMBL problem? - Allopurinol: different tautomer is drawn, is this actually stable? The failures were manually reviewed once, and things that could be captured by a manual Google search were updated in the RENAME_TABLE in the code to get it to search for the correct thing in ChEMBL. Note, 24 of the lines in the resulting doi.org-10.1101-2020.04.03.023846-with-smiles.csv don't have a ChEMBL compound name, just a ChEMBL identifier. These all have max_phase 0, meaning ChEMBL doesn't recognize these as approved or investigational drugs. Note, ChEMBL recognizes a fair number of the compounds as max_phase 0, so they have not entered clinical trials yet, Prestwick describes the set as "A unique collection of 1520 off-patent small molecules, 99% approved drugs (FDA, EMA and other agencies)", so I would appreciate a description of the descrepancy between ChEMBL and Prestwick. The breakdown of ChEMBL max_phases in this dataset is the following: CHEMBL_MAX_PHASE 0 332 0,3 4 (multiple ChEMBL records identified from the compound name, different isomers or salt forms) 0,4 33 1 7 2 19 3 61 4 1048 The update-with-inhibition-data.py script can be used to cross reference computational predictions back to this experimental data. Run it in the following way: $ python update-with-inhibition-data.py computational-predictions.csv 'My_Awesome_Score' This will use the SMILES columns in each of the CSV files to match the molecules against each other. This match is done in a chemically reasonable way using the InChI string which provides a useful default standardization scheme for normalizing tautomers and protonation states that are usually necessary for computational predictions, but irrelevant in the experimental data. If an exact InChI match isn't found, a fallback InChI search is performed using the non-stereo layers of the InChI to disregard stereochemistry differences as well. A lot of 3D prediction tools need stereochemistry defined, and this information is retained through the SMILES strings, however, the original experimental data may have been against a racemate. The tool will output a .png file of the scatter plot between the 'Inhibition index' and whatever column is specified on the commandline as the computational score.
isayev/covid-in-vitro-screening-data
Experimental screen of COVID inhibitors with SMILES strings and ChEMBL ids
Python