Build an SQLite Database from Pubchem

This library can be used to build an SQLite database (DB) containing all compounds in PubChem. A simple json-file is used to specify the layout of the SQLite DB allowing to controll which PubChem SD-tags should be included.

Features:

Construct a local copy of the PubChem compound database from publicly available sdf-files.
Local database is SQLite database
Database layout and the PubChem information to extract can be specified.

Installation

The library libraries only requirement is Python (tested on 3.5 - 3.9), no external libraries need to be installed. Simply run:

git clone https://github.com/bachi55/local_pubchem_db
cd local_pubchem_db
pip install .

Usage

(1) Download PubChem SDF-files

PubChem allows to download a snapshot of their current database (~75GB). You can use the following command to retrieve the sdf-files:

wget -e robots=off --recursive --no-parent https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/
mkdir -p pubchem/sdf
mv ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/*.sdf.gz pubchem/sdf
rm -r ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF

(2) Specifiy the DB Layout

You can specify the DB layout using a simple json-file (see also this example):

{
    "columns": {
        "InChI": {
            "SD_TAG": ["PUBCHEM_IUPAC_INCHI"],
            "DTYPE": "varchar",
            "NOT_NULL": true,
            "PRIMARY_KEY": true
        },
        "InChIKey": {
            "SD_TAG": ["PUBCHEM_IUPAC_INCHIKEY"],
            "DTYPE": "varchar",
            "NOT_NULL": true
        },
        "InChIKey_1": {
            "SD_TAG": ["PUBCHEM_IUPAC_INCHIKEY"],
            "DTYPE": "varchar",
            "NOT_NULL": true,
            "CREATE_LIKE": "lambda __x: __x.split('-')[0]"
        },     
        "xlogp3": {
            "SD_TAG": ["PUBCHEM_XLOGP3", "PUBCHEM_XLOGP3_AA"],
            "DTYPE": "real",
            "NOT_NULL": false
        },
        "cid": {
            "SD_TAG": ["PUBCHEM_COMPOUND_CID"],
            "DTYPE": "integer",
            "NOT_NULL": true
        }
    }
}

The json-file contains consists of nested dictionaries. The one containing the DB layout is accessed via columns. This dictionary contains a key for each column name in the compounds table and its properites:

Key	Description	Optional
SD_TAG	List of SD-Tags correponding to the column. For example `[PUBCHEM_IUPAC_INCHI]` refering to the InChI of a compound, or, `["PUBCHEM_XLOGP3", "PUBCHEM_XLOGP3_AA"]` to its predicted XLogP3 value (here more than one tag can occure in the sdf-files).	No
DTYPE	Type of the information behind the sd-tag. Must be a valid SQLite type.	No
NOT_NULL	If true, the column cannot contain null entries. If a compound does not have the requested property and the corresponding column should not be null, it will not be added to tha DB.	Yes
PRIMARY_KEY	If true, the column is used as primary key for the compounds table. Currently, only one column can be specified as primary key.	Yes
WITH_INDEX	If true, an index is created for the column. This allows faster queries with constraints on this column.	Yes
CREATE_LIKE	Value of the SD_TAG can be transformed using a Python lambda function defined as string. For example `"lambda __x: __x.split('-')[0]"` applied to the InChIKey tag can be used to only return the first InChIKey part. The lambda function is applied after the value has been converted into the target DTYPE.	Yes

(3) Build the Database

The DB is build using:

cd pubchem
mkdir db
# Create your db-layout
touch db/db_layout.json
# Edit it ... 
# vim db/db_layout.json
python /path/to/local_pubchem_db/build_pubchem_db.py pubchem --gzip --db_layout_fn=db/db_layout.json

Please note: PubChem contains many compound and the resulting SQLite file can be very large. Especially when the indices are created sqlite might require twice the memory of the DB for some time. Ensure your drive has enough memory. If you face the problem, that your systems temp-directory runs out of memory, when creating the indices, have a look here. You can temporarily change sqlite's temp-directory by setting SQLITE_TMPDIR:

SQLITE_TMPDIR=/my/large/disk/temp python /path/to/local_pubchem_db/build_pubchem_db.py pubchem --gzip --db_layout_fn=db/db_layout.json

Version History

0.3:

Add support for data-transformations of SD_TAG values using lambda functions

0.2:

Add flexibility for the DB-layout using a json-file
Add tests
Publish code as package
Change license to MIT

0.1:

Initial Version