This library can be used to build an SQLite database (DB) containing all compounds in PubChem. A simple json-file is used to specify the layout of the SQLite DB allowing to controll which PubChem SD-tags should be included.
- Construct a local copy of the PubChem compound database from publicly available sdf-files.
- Local database is SQLite database
- Database layout and the PubChem information to extract can be specified.
The library libraries only requirement is Python (tested on 3.5 - 3.9), no external libraries need to be installed. Simply run:
git clone https://github.com/bachi55/local_pubchem_db
cd local_pubchem_db
pip install .
PubChem allows to download a snapshot of their current database (~75GB). You can use the following command to retrieve the sdf-files:
wget -e robots=off --recursive --no-parent https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/
mkdir -p pubchem/sdf
mv ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/*.sdf.gz pubchem/sdf
rm -r ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF
You can specify the DB layout using a simple json-file (see also this example):
{
"columns": {
"InChI": {
"SD_TAG": ["PUBCHEM_IUPAC_INCHI"],
"DTYPE": "varchar",
"NOT_NULL": true,
"PRIMARY_KEY": true
},
"InChIKey": {
"SD_TAG": ["PUBCHEM_IUPAC_INCHIKEY"],
"DTYPE": "varchar",
"NOT_NULL": true
},
"InChIKey_1": {
"SD_TAG": ["PUBCHEM_IUPAC_INCHIKEY"],
"DTYPE": "varchar",
"NOT_NULL": true,
"CREATE_LIKE": "lambda __x: __x.split('-')[0]"
},
"xlogp3": {
"SD_TAG": ["PUBCHEM_XLOGP3", "PUBCHEM_XLOGP3_AA"],
"DTYPE": "real",
"NOT_NULL": false
},
"cid": {
"SD_TAG": ["PUBCHEM_COMPOUND_CID"],
"DTYPE": "integer",
"NOT_NULL": true
}
}
}
The json-file contains consists of nested dictionaries. The one containing the DB layout is accessed via columns
. This dictionary contains a key for each column name in the compounds table and its properites:
Key | Description | Optional |
---|---|---|
SD_TAG | List of SD-Tags correponding to the column. For example [PUBCHEM_IUPAC_INCHI] refering to the InChI of a compound, or, ["PUBCHEM_XLOGP3", "PUBCHEM_XLOGP3_AA"] to its predicted XLogP3 value (here more than one tag can occure in the sdf-files). |
No |
DTYPE | Type of the information behind the sd-tag. Must be a valid SQLite type. | No |
NOT_NULL | If true, the column cannot contain null entries. If a compound does not have the requested property and the corresponding column should not be null, it will not be added to tha DB. | Yes |
PRIMARY_KEY | If true, the column is used as primary key for the compounds table. Currently, only one column can be specified as primary key. | Yes |
WITH_INDEX | If true, an index is created for the column. This allows faster queries with constraints on this column. | Yes |
CREATE_LIKE | Value of the SD_TAG can be transformed using a Python lambda function defined as string. For example "lambda __x: __x.split('-')[0]" applied to the InChIKey tag can be used to only return the first InChIKey part. The lambda function is applied after the value has been converted into the target DTYPE. |
Yes |
The DB is build using:
cd pubchem
mkdir db
# Create your db-layout
touch db/db_layout.json
# Edit it ...
# vim db/db_layout.json
python /path/to/local_pubchem_db/build_pubchem_db.py pubchem --gzip --db_layout_fn=db/db_layout.json
Please note: PubChem contains many compound and the resulting SQLite file can be very large. Especially when the indices are created sqlite might require twice the memory of the DB for some time. Ensure your drive has enough memory. If you face the problem, that your systems temp-directory runs out of memory, when creating the indices, have a look here. You can temporarily change sqlite's temp-directory by setting SQLITE_TMPDIR
:
SQLITE_TMPDIR=/my/large/disk/temp python /path/to/local_pubchem_db/build_pubchem_db.py pubchem --gzip --db_layout_fn=db/db_layout.json
- Add support for data-transformations of SD_TAG values using lambda functions
- Add flexibility for the DB-layout using a json-file
- Add tests
- Publish code as package
- Change license to MIT
- Initial Version