Encoding new unseen molecules

Question

Encoding new unseen molecules

manavsingh415 opened this issue 4 years ago · 2 comments

Hi. When trying to create 512 dimensional vector representations of some new molecules (that the encoder may not have seen during training), I get the following error

Traceback (most recent call last):
File "encode.py", line 56, in
encode(**args)
File "encode.py", line 35, in encode
latent = model.transform(model.vectorize(mols_in))
File "/content/latent-gan/ddc_pub/ddc_v3.py", line 1042, in vectorize
return self.smilesvec1.transform(mols_test)
File "/content/latent-gan/molvecgen/vectorizers.py", line 145, in transform
one_hot[i,j+offset,charidx] = 1
IndexError: index -201 is out of bounds for axis 1 with size 138

I am using the pretrained chembl encoder. Any ideas about how to resolve? Thanks

Answer 1 · 2022-02-08T16:27:51.000Z

Did you find a solution to this?

Answer 2 · 2022-02-08T17:43:22.000Z

Because they explicitly mention in the README that the token length limit is 128, I decided to use SmilesVectorizer from molvecgen. I removed all SMILES for which the token vector has a length larger than the limit.

Suppose your data frame is called data in the example below.

remove = []

TOKEN_LENGTH_LIMIT = 128


for index, row in tqdm(data.iterrows(), total=len(data)):
    mol = Chem.MolFromSmiles(row.SMILES)
    sm_en = SmilesVectorizer(canonical=True, augment=False)
    sm_en.fit([mol], extra_chars=["\\"])

    if sm_en.maxlength > TOKEN_LENGTH_LIMIT:
        remove.append(index)


print(
    f"There are {len(remove)} smiles with a token length larger than {TOKEN_LENGTH_LIMIT}"
)

data.drop(remove, inplace=True)
data.to_csv("preprocessed.csv", index=False, header=False)

And now it worked.

The other way will be that if too many molecules are discarded because they have a token length larger than 128, you retrain the autoencoder again.

Good luck.