covid-in-vitro-screening-data: A Python repository from isayev

doi.org-10.1101-2020.04.03.023846-with-smiles.csv is generated by the
populate-smiles.py script. It generates the SMILES, ChEMBL IDs, and
ChEMBL compound names, and ChEMBL max_phase (the highest clinical
trial phase achieved) for the drug repurposing screen done in
https://www.biorxiv.org/content/10.1101/2020.04.03.023846v1

The supplemental information from the original paper only provides an
excel file with compound names and an image for the structure so is
not suitable as a computational dataset.

populate-smiles.py uses this compound name to query chembl, if it
can't be found, it tries harder, even using the NIH resolver server to
help if it has to. A postgres instance of ChEMBL is recommended, the
sqlite version of ChEMBL is extremely slow in comparison.

If multiple forms are found in ChEMBL it ensures they are just
separate salt forms or isomers and merges them together, so multiple
ChEMBL records may be linked.

This was able to resolve 1504 of the 1520 compounds listed in the file.

14 of the 16 missing could be indeed be resolved by the NIH resolver,
but still couldn't be found in chembl. These can be found in the
unknown-to-chembl.csv, loosening the match criteria may find a way to
get these to match, but I would rather get this data out. They also
mostly look like aweful non-drug-like compounds anyway. 

Note, there are two cases when the name wasn't unique in the original Excel file: 

- Loracarbef: different stereoisomers are drawn in the excel, but
  chembl only has a single stereoisomer available, but two salt forms,
  so potentially a ChEMBL problem?

- Allopurinol: different tautomer is drawn, is this actually stable?

The failures were manually reviewed once, and things that could be
captured by a manual Google search were updated in the RENAME_TABLE in
the code to get it to search for the correct thing in ChEMBL. 

Note, 24 of the lines in the resulting
doi.org-10.1101-2020.04.03.023846-with-smiles.csv don't have a ChEMBL
compound name, just a ChEMBL identifier. These all have max_phase 0,
meaning ChEMBL doesn't recognize these as approved or investigational
drugs.

Note, ChEMBL recognizes a fair number of the compounds as max_phase 0,
so they have not entered clinical trials yet, Prestwick describes the
set as "A unique collection of 1520 off-patent small molecules, 99%
approved drugs (FDA, EMA and other agencies)", so I would appreciate a
description of the descrepancy between ChEMBL and Prestwick.

The breakdown of ChEMBL max_phases in this dataset is the following:

CHEMBL_MAX_PHASE                                                                                                                                                                                        
0                    332
0,3                    4 (multiple ChEMBL records identified from the compound name, different isomers or salt forms)
0,4                   33
1                      7
2                     19
3                     61
4                   1048



The update-with-inhibition-data.py script can be used to cross
reference computational predictions back to this experimental
data. Run it in the following way:

$ python update-with-inhibition-data.py computational-predictions.csv 'My_Awesome_Score'

This will use the SMILES columns in each of the CSV files to match the
molecules against each other. This match is done in a chemically
reasonable way using the InChI string which provides a useful default
standardization scheme for normalizing tautomers and protonation
states that are usually necessary for computational predictions, but
irrelevant in the experimental data. If an exact InChI match isn't
found, a fallback InChI search is performed using the non-stereo
layers of the InChI to disregard stereochemistry differences as
well. A lot of 3D prediction tools need stereochemistry defined, and
this information is retained through the SMILES strings, however, the
original experimental data may have been against a racemate. The tool
will output a .png file of the scatter plot between the 'Inhibition
index' and whatever column is specified on the commandline as the
computational score.
isayev/covid-in-vitro-screening-data