
ML models to convert molecules to ESI mass spectra and maybe back again

Primary LanguageJupyter Notebook


Machine learning models to convert molecules to ESI mass spectra (and maybe back again in a future version) trained on GNPS data. Currently the following models are available:


model description hidden unit dim num layers
mlp MLP with residual blocks trained on 1024 Morgan fingerprints 1024 6
gcn Simple GCN with deepchem like node features 1024 3
egnn Equivariant GNN trained on (RDkit optimized) 3D structures 1024 2
bert MLP trained on representations from the (smaller) ChemBERTa SMILES model 1024 6


You need to have torch and torch_geometric installed. I don't provide these as part of the dependencies since torch_geometric installs depends a lot on your CUDA and torch setup. To install torch_geometric from scratch use their documentation; e.g. can do it with pip using their wheels:

pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.9.0+cpu.html

Once torch and torch_geometric are installed, you can install molxspec:

pip install https://github.com/dimenwarper/molxspec/releases/download/v.0.1.0/molxspec-0.1.0-py3-none-any.whl


You can predict spectra from the command line:

mol2spec --model [mlp | gcn | egnn | bert] input_smiles.txt output.txt

Where input_smiles.txt is a file containing one molecule SMILES for each line. For the egnn model, molecules will have their 3D structure computed and optimized automatically using RDKit. First time use will download the pretrained models automatically, which can take some time, though it is a one-time thing only.

You can also predict spectra programmatically:

from molxspec import mol2spec
dict_of_smiles_and_spectra = mol2spec.predict(list_of_smiles)