/selfies

Robust representation of semantically constrained graphs, in particular for molecules in chemistry

Primary LanguagePythonApache License 2.0Apache-2.0

SELFIES

SELFIES (SELF-referencIng Embedded Strings) is a general-purpose, sequence-based, robust representation of semantically constrained graphs. It is based on a Chomsky type-2 grammar, augmented with two self-referencing functions. A main objective is to use SELFIES as direct input into machine learning models, in particular in generative models, for the generation of graphs with high semantical and syntactical validity.

See the paper at arXiv: https://arxiv.org/abs/1905.13741

The code presented here is a concrete application of SELFIES in chemistry, for the robust representation of molecule. We show the encoding and decoding of three molecules from various databases, and the generation of a new, random molecule with high semantical and syntactical validity.

Installation

You can install SELFIES via

pip install selfies

Examples

Several examples can be seen in examples/selfies_example.py. Here is a simple encoding and decoding:

from selfies import encoder, decoder
    
test_molecule1='CN1C(=O)C2=C(c3cc4c(s3)-c3sc(-c5ncc(C#N)s5)cc3C43OCCO3)N(C)C(=O)C2=C1c1cc2c(s1)-c1sc(-c3ncc(C#N)s3)cc1C21OCCO1' # non-fullerene acceptors for organic solar cells
selfies1=encoder(test_molecule1)
smiles1=decoder(selfies1)

print('test_molecule1: '+test_molecule1+'\n')
print('selfies1: '+selfies1+'\n')
print('smiles1: '+smiles1+'\n')
print('equal: '+str(test_molecule1==smiles1)+'\n\n\n')
  • an example of SELFIES in a generative model can be seen in the directory 'VariationalAutoEncoder_with_SELFIES'. There, SMILES datasets are automatically translated into SELFIES, and used for training of a variational autoencoder (VAE).

Python version

fully tested with Python 3.7.1 on

supported:

  • Python 3.7.2, 3.7.1, 3.6.8, 3.6.7, 2.7.15