possible issue with random sampling

Question

possible issue with random sampling

ga01 opened this issue 6 years ago · 4 comments

Hi guys,

I have encountered the following "weird" behaviour when I sample the latent space near molecules near a SMILES: the output molecules somehow change little with the specified noise level. My installation seems to be okay, it reproduces the examples (I'm using a CPU based installation), so I wonder whether I am missing something. I provide below some examples, but it is the case for many other molecules. (For the cases here I take only 100 samples, but for "production" work I take tens of thousands, and the pattern remains)

Noise 200:
$ python get_vae_smiles.py "CSCC(=O)NNC(=O)c1c(C)oc(C)c1C" 2>/dev/null
Using standarized functions? True
Standarization: estimating mu and std values ...done!
Input : CSCC(=O)NNC(=O)c1c(C)oc(C)c1C
Reconstruction : CSCC(=O)N(C(=O)c1c(C)oc(C)c1C
Z representation : (1, 196) with norm 10.705
Searching molecules randomly sampled from 200.00 std (z-distance) from the point
Found 10 unique mols, out of 30
SMILES
0 CSCC(=O)NNC(=O)c1c(C)oc(C)c1C
1 CSC(C=O)NNC(=O)c1c(C)oc(C)c1C
2 COCC(=O)NC(C=O)c1c(C)oc(C)c1C
3 CSCC(=O)NCC(=O)c1c(C)oc(C)c1C
4 COCC(=O)NCC(=O)c1c(C)oc(C)c1C
5 CSC(C=O)NCC(=O)c1c(C)oc(C)c1C
6 COCC(=O)NCC(=O)c1c(C)oc(C)c1Cl
7 CSC(C=O)NCC(=O)c1c(F)oc(C)c1C
8 COCC(=O)NC(=O)c1cc(O)nc(C)c1C
9 C#COC(=N)NC(=O)c1ccccc(Cl)cc1Cl
Name: smiles, dtype: object

Noise 2:
Searching molecules randomly sampled from 2.00 std (z-distance) from the point
Found 13 unique mols, out of 75
SMILES
0 CSCC(=O)NNC(=O)c1c(C)oc(C)c1C
1 CSC(C=O)NNC(=O)c1c(C)oc(C)c1C
2 CSCC(=O)NC(C=O)c1c(C)oc(C)c1C
3 COCC(=O)NC(C=O)c1c(C)oc(C)c1C
4 CSCC(=O)NCC(=O)c1c(C)oc(C)c1C
5 COC(C=O)NNC(=O)c1c(C)oc(C)c1C
6 COCC(=O)NCC(=O)c1c(C)oc(C)c1C
7 CSCC(=O)NCC(=O)c1c(O)oc(C)c1C
8 CSC(C=O)NCC(=O)c1c(C)oc(C)c1C
9 CSC(C=O)NCC(=O)c1c(F)oc(C)c1C
10 COC(C=O)NCC(=O)c1c(C)oc(C)c1C
11 CSCC(=O)N/C(=O)c1c(C)oc(C)c1C
12 ClCC(=O)NCC(=O)c1c(C)oc(C)c1C
Name: smiles, dtype: object

Searching molecules randomly sampled from 50.00 std (z-distance) from the point
Found 14 unique mols, out of 65
SMILES
0 CSCC(=O)NNC(=O)c1c(C)oc(C)c1C
1 COCC(=O)NNC(=O)c1c(C)oc(C)c1C
2 CSC(C=O)NNC(=O)c1c(C)oc(C)c1C
3 COCC(=O)NC(C=O)c1c(C)oc(C)c1C
4 CSCC(=O)NCC(=O)c1c(C)oc(C)c1C
5 CSC(C=O)NC(C=O)c1c(C)oc(C)c1C
6 COCC(=O)NCC(=O)c1c(C)oc(C)c1C
7 CSC(C=O)NCC(=O)c1c(C)oc(C)c1C
8 CSC(C=O)NCC(=O)c1c(F)oc(C)c1C
9 COC(C=O)NCC(=O)c1c(C)oc(C)c1C
10 CSCC(=O)N/C(=O)c1c(C)oc(C)c1C
11 ClC(C=O)NCC(=O)c1c(C)oc(C)c1C
12 ClCC(=O)NCC(=O)c1c(C)oc(C)c1C
13 ClCC(=O)NC(C=O)c1c(C)oc(C)c1C
Name: smiles, dtype: object

So it seems that for large Z distances the SMILES are not so much different than for small distances. )What is the distribution of the random sampling? I would expect this if the random sampling is not uniform and heavily biased towards the coordinates of input SMILES, so the specified noise level affects only the peripheries, and most molecules of the output still originate from the close neighbourhood of the SMILES.

I would greatly appreciate any help with this issue.

Best wishes,
Gyorgy Abrusan

Answer 1 · 2018-12-19T02:57:52.000Z

The average distance between molecules is ~20. The molecules are distributed as you said, close to the SMILES of the training set molecules.

My guess is that setting the noise to a very large value makes it difficult to find valid SMILES that are correct. Bumping @beangoben to see if he has any ideas.

Answer 2 · 2018-12-19T19:19:41.000Z

hi Abrusan,

I think the problem might be that the noise level is too high. Searching for molecules from random vectors that are 50-200 STD (z-distance wise) is huge. Each dimension is assumed to be gaussian distributed..so the actual probability mass outside of 2-4 STD should be quite small (https://arxiv.org/abs/1609.04468). If i had to guess, I would think the RNN is just decoding whatever it could make sense from the first molecule in the batch.

I think you will find more subtle differences and a larger variety if you sample with 0.5 (local neighborhood), 1.0-2.0 (random molecules).
Also not sure if this is some effect of the decoder as coded. An extra repo for generative molecules you can also test out and has a pytorch impletation is https://github.com/molecularsets/moses

Answer 3 · 2018-12-20T21:36:00.000Z

Hi guys,

Thanks for the comments.

I have to admit I still think something critical is missing. I used several noise levels: 3, 6, 12, 25, 50, 100 (but also tried 0.1, and even 200; [noise=N, df = vae.z_to_smiles(z_1,decode_attempts=100,noise_norm=noise)]). My aim was to have a gradient between noise levels that sample the close neighborhood of a SMILES, and between noise levels that effectively pick SMILES randomly from the entire latent space. To my surprise, dramatic increases in the specified noise levels lead to rather modest increases in the diversity of returned molecules - I have not reached a noise level that effectively returns SMILES that are structurally unrelated to the input. (In other words - what noise level should I use to sample the entire latent space, essentially randomly?)

So it seems that in practice the relationship between noise level and the structural diversity of the returned smiles is rather nontrivial, which is surprising, given the perturb_z (vae_utils.py) function (but I am not a python programmer). It would be great if you could clarify this.

Best wishes,
Gyorgy

Answer 4 · 2018-12-22T08:19:02.000Z

Hi guys,
just one more comment. I wonder whether the solution is due to an error in my assumption, that by increasing the noise level I can reach a noise level that will result in SMILES randomly picked from the latent space. The perturb_z function basically adds and amplifies gaussian noise to the Z vector. However, if this distorts Z in qualitatively different ways from valid SMILES (i.e.de differences between valid Z vectors are not normally distributed), than only a small fraction of perturbed vectors - the ones closest to the input vector - will produce valid smiles. In other words, by adding and amplifying gaussian noise it is not possible to reach a noise level that produces valid smiles that sample the entire latent space. This is good - makes the VAE robust - but makes my original goal impossible.

Best wishes,
Gyorgy