Encoding new unseen molecules
manavsingh415 opened this issue · 2 comments
Hi. When trying to create 512 dimensional vector representations of some new molecules (that the encoder may not have seen during training), I get the following error
Traceback (most recent call last):
File "encode.py", line 56, in
encode(**args)
File "encode.py", line 35, in encode
latent = model.transform(model.vectorize(mols_in))
File "/content/latent-gan/ddc_pub/ddc_v3.py", line 1042, in vectorize
return self.smilesvec1.transform(mols_test)
File "/content/latent-gan/molvecgen/vectorizers.py", line 145, in transform
one_hot[i,j+offset,charidx] = 1
IndexError: index -201 is out of bounds for axis 1 with size 138
I am using the pretrained chembl encoder. Any ideas about how to resolve? Thanks
Did you find a solution to this?
Because they explicitly mention in the README that the token length limit is 128, I decided to use SmilesVectorizer
from molvecgen
. I removed all SMILES for which the token vector has a length larger than the limit.
Suppose your data frame is called data
in the example below.
remove = []
TOKEN_LENGTH_LIMIT = 128
for index, row in tqdm(data.iterrows(), total=len(data)):
mol = Chem.MolFromSmiles(row.SMILES)
sm_en = SmilesVectorizer(canonical=True, augment=False)
sm_en.fit([mol], extra_chars=["\\"])
if sm_en.maxlength > TOKEN_LENGTH_LIMIT:
remove.append(index)
print(
f"There are {len(remove)} smiles with a token length larger than {TOKEN_LENGTH_LIMIT}"
)
data.drop(remove, inplace=True)
data.to_csv("preprocessed.csv", index=False, header=False)
And now it worked.
The other way will be that if too many molecules are discarded because they have a token length larger than 128, you retrain the autoencoder again.
Good luck.