compound-annotator

Credits: Lewis Mervin for the orignal source code.

Setup

Install Python
Install Poetry
Install Poetry Environment: poetry install

For Linux, see

python-poetry/poetry#1917 (comment) if installing six fails
https://stackoverflow.com/a/75435100 if you get "does not contain any element" warning when running poetry install

Run

Create annotation file

On a VM with >40G disk space, download ChEMBL SQLite database (4.2G compressed, 23G uncompressed)

wget https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/chembl_31_sqlite.tar.gz
tar -xvzf chembl_31_sqlite.tar.gz
tree chembl_31
# chembl_31
# └── chembl_31_sqlite
#     ├── INSTALL_sqlite
#     └── chembl_31.db

Run a SQL query to extract ChEMBL annotation

sqlite3 -header -csv chembl_31/chembl_31_sqlite/chembl_31.db < sql/extract_chembl_annotation.sql | gzip > data/chembl_annotation.csv.gz

View the top 5 rows of the annotation file

python csv2md.py <(gzcat data/chembl_annotation.csv.gz|head -n 5)

assay_chembl_id	target_chembl_id	assay_type	molecule_chembl_id	pchembl_value	confidence_score	standard_inchi_key
1714633	CHEMBL3987582	B	CHEMBL4107559	6.07	7	UVVXRMZCPKQLAO-OAHLLOKOSA-N
1714649	CHEMBL3987582	B	CHEMBL4107559	5.86	7	UVVXRMZCPKQLAO-OAHLLOKOSA-N
1714633	CHEMBL3987582	B	CHEMBL4108338	6.15	7	OZBMIGDQBBMIRA-CQSZACIVSA-N
1714649	CHEMBL3987582	B	CHEMBL4108338	5.84	7	OZBMIGDQBBMIRA-CQSZACIVSA-N

Count the number of rows in the annotation file

gzcat data/chembl_annotation.csv.gz | wc -l
# 1185184

Count the number of unique values of each column in the annotation file

function count_unique_values() {
    data_file=$1
    colnames=$2
    for colname in ${colnames}; do
        echo -n $colname:
        gzcat ${data_file} | csvcut -c ${colname} | tail -n +2 | sort | uniq | wc -l | tr -s " "
    done
}

data_file=data/chembl_annotation.csv.gz
colnames="assay_chembl_id target_chembl_id assay_type molecule_chembl_id standard_inchi_key pref_name"
count_unique_values ${data_file} "${colnames}"

assay_chembl_id: 99298
target_chembl_id: 3076
assay_type: 2
molecule_chembl_id: 556272
standard_inchi_key: 56272
pref_name: 6536

Create filtered annotation file

Filter the annotation file to only include rows with standard_inchi_key that are present in the compound.csv.gz file

wget https://raw.githubusercontent.com/jump-cellpainting/datasets/0682dd2d52e4d68208ab4af3a0bd114ca557cb0e/metadata/compound.csv.gz
mv compound.csv.gz data/

gzcat data/compound.csv.gz | csvcut -c Metadata_InChIKey| tail -n +2 | sort | uniq > data/compound_inchi_key.txt

Now find rows in data/chembl_annotation.csv that have standard_inchi_key that are present in data/compound_inchi_key.txt

csvgrep -c standard_inchi_key -f data/compound_inchi_key.txt <(gzcat data/chembl_annotation.csv.gz) | gzip > data/chembl_annotation_filtered.csv.gz

Count the number of rows in the filtered annotation file

gzcat data/chembl_annotation_filtered.csv.gz | wc -l
# 44018

Count the number of unique values of each column in the filtered annotation file

data_file=data/chembl_annotation_filtered.csv.gz
colnames="assay_chembl_id target_chembl_id assay_type molecule_chembl_id standard_inchi_key pref_name"
count_unique_values ${data_file} "${colnames}"

assay_chembl_id: 18856
target_chembl_id: 1744
assay_type: 2
molecule_chembl_id: 4718
standard_inchi_key: 4718
pref_name: 1340

Here are all the commands in one place to create chembl_annotation_filtered.csv.gz from chembl_annotation.csv.gz and compound.csv.gz:

commit=0682dd2d52e4d68208ab4af3a0bd114ca557cb0e

wget https://raw.githubusercontent.com/jump-cellpainting/datasets/${commit}/metadata/compound.csv.gz

mv compound.csv.gz data/

gzcat data/compound.csv.gz | csvcut -c Metadata_InChIKey| tail -n +2 | sort | uniq > data/compound_inchi_key.txt

csvgrep -c standard_inchi_key -f data/compound_inchi_key.txt <(gzcat data/chembl_annotation.csv.gz) | gzip > data/chembl_annotation_filtered.csv.gz

Create mapping between `standard_inchi_key` and `chembl_id`

Run SQL query to get mapping between standard_inchi_key and chembl_id

sqlite3 -header -csv chembl_31/chembl_31_sqlite/chembl_31.db < sql/extract_chembl_inchikey_mapping.sql  | gzip > data/inchikey_chembl.csv.gz

View the top 5 rows of the inchikey_chembl.csv.gz file

python csv2md.py <(gzcat data/inchikey_chembl.csv.gz|head -n 5)

molecule_chembl_id	standard_inchi_key	pref_name
CHEMBL592894	AAAJHRMBUHXWLD-UHFFFAOYSA-N
CHEMBL268868	AAALVYBICLMAMA-UHFFFAOYSA-N	DAPH
CHEMBL1734241	AAAZRMGPBSWFDK-UHFFFAOYSA-N
CHEMBL3449946	AABSTWCOLWSFRA-UHFFFAOYSA-N

Count the number of rows in the inchikey_chembl.csv.gz file

gzcat data/inchikey_chembl.csv.gz | wc -l
# 2304876

Now find rows in data/inchikey_chembl.csv.gz that have standard_inchi_key that are present in data/compound_inchi_key.txt

csvgrep -c standard_inchi_key -f data/compound_inchi_key.txt <(gzcat data/inchikey_chembl.csv.gz) | gzip > data/inchikey_chembl_filtered.csv.gz

Count the number of unique values of each column in inchikey_chembl_filtered.csv.gz

data_file=data/inchikey_chembl_filtered.csv.gz
colnames="molecule_chembl_id standard_inchi_key pref_name"
count_unique_values ${data_file} "${colnames}"

molecule_chembl_id: 30072
standard_inchi_key: 30072
pref_name: 2508

wc -l data/compound_inchi_key.txt
# 116753

kbardool/jump-cellpainting-compound-annotator

compound-annotator

Setup

Run

Create annotation file

Create filtered annotation file

Create mapping between standard_inchi_key and chembl_id

Create mapping between `standard_inchi_key` and `chembl_id`