How to randomly generate molecules?
zhouhao-learning opened this issue · 15 comments
Hello, I am very happy to see your research. I want to use the trained generation model to randomly generate molecules without setting any conditions, but I don't know how to operate, so please give me more guidance, how should I do it? Can you generate molecules randomly instead of conditionally?
Best !
Hi there,
If you look at one of the Jupyter notebooks available, for example the LogP one, you will notice that before biasing the generator it generates an unbiased distribution, which later is used for comparison.
I believe what you want is exactly this unbiased distribution.
@gmseabra
Hello, I didn't see what you mean, but, as far as I know, your generation is based on reinforcement learning strategies. For example, the rules you set are high melting point, polystyrene ring, and my idea is , let the generated model randomly generate new molecules, without setting any bias, nor based on any rules
Please guide me again, thank you!
Best
@zhouhao-learning : this is exactly what @gmseabra means! If you just use the generator it will produce unbiased new molecules without property optimization.
@isayev
So what are the rules for intensive learning? The paper actually mentions that the new molecules generated by the generator are under the conditions of intensive learning. Now I don't want to use any rules. Is my understanding wrong? Please give me more guidance. Thank you!
@zhouhao-learning Please elaborate in depth what exactly do you mean? By 'don't want to use any rules' do you mean all rules of chemistry or specific property optimization? If former, those are random smiles, they do not correspond to valid chemical structures. Otherwise, just use pretrined baseline generator and will do exactly what are you asking for,
@isayev
Hello, I mean, I want to use my SMILES data to train a build model for generating new molecules, but I hope that the generated model does not generate biased molecules under the rules of reinforcement learning, because I understand in the paper By the way, the numerator generated in the paper is under the rule of strengthening learning settings. Now I don't want to generate new molecules by strengthening the learning rules. I want to cancel the rules of reinforcement learning, and then generate. Do you understand what I mean? Thank you!
Sorry, I misunderstood you. But to me, "use my SMILES data" does not mean "randomly generate molecules". However, this could be done quite easily.
- You could load our pretrained model and continue training with your data. Please load Jupyter notebook and for example "LogP_optimization_demo.ipynb". Uncomment lines 16-19 and just use your data in line 9 to train. You will probably need to play with a learning rate etc.
- You could train your model fully scratch without loading our weights. This is not recommended unless you have a very big dataset. Training a good model might take 1-2 weeks on a single GPU.
@isayev
Thank you for your patience. Sorry, I may not have described it in great detail. The method you said should be migration learning. I will use the migration learning method to train the generated model. I actually want to use my SMILES data to train a generation model. This generation model is used to generate new molecules. My problem is in generating new molecules. Here, how can I use the trained generation model to generate new molecules directly, instead of generating new molecules under the rules of reinforcement learning, you know there are Is the method implemented?
Thank you!
Yes, option number 2 above. Load your SMILES in cell 9, and train your own generator.
@isayev
So can I use the generator to generate new molecules directly? I don't care to reinforce the rules in learning? If so, thank you very much, I will try the feasibility of this method, thank you very much!
@isayev
I suddenly thought of another problem. My build model should be used together with the predictive model. This should use the GAN
method. If no predictive model continuously feeds back the generated model, then my build model only learns the syntax of SMILES. Without capturing the characteristics of my training data, the new molecules generated will not have some similar characteristics to my training data. Are you saying that I am right?
@isayev
I suddenly thought of another problem. My build model should be used together with the predictive model. This should use the GAN method. If no predictive model continuously feeds back the generated model, then my build model only learns the syntax of SMILES. Without capturing the characteristics of my training data, the new molecules generated will not have some similar characteristics to my training data. Are you saying that I am right?
NO. If you continually feed the model your own molecules (assuming they have some common characteristic, that is, they are not just random molecules), eventually the model will learn to generate molecules that will have structural characteristics similar to the molecules you are feeding it. So, it will eventually capture structural characteristics of your training data.
Lets look again at the LogP_optimization_demo.ipynb
file. If you just load the pre-trained generator, as it is done in cell 20, you get a generator that was trained just to generate valid SMILES. In training this generator, no property selection was used, of any kind. In training this generator they just used the molecules from ChEMBL, a huge set of known, already synthesized molecules, and the generator uses it just to learn how to generate a valid molecule. No optimizations, no properties. So, if all you want is to generate random molecules, you could just use this generator.
Now, if you want, you can train the generator with the molecules you have in your database. But pay attention to the following: the generator needs a large dataset to learn the "rules of chemistry" by itself, and generate valid molecules. Currently it learns the rules from the ChEMBL data, which is very large (>1.5 million molecules). It is very unlikely that your database will be larger than ChEMBL, and the new generator will probably not work as well, as it will tend to generate molecules that look like the ones in your dataset but without learning that much the chemistry rules.
But it can be done if that's what you want. To train the generator with your smiles, just change the path in cell 9 to point to your SMILES file, then uncomment cell 16 to train the generator.
Finally, I just want to point out that another (probably better) option is to retrain the generator with your molecules by just creating a new gen_data
(see cells 9-11) with your molecules, then re-train the generator with this new gen_data
for a new number of steps. For example,
new_gen_data_path = '<PATH_TO_YOUR_SMILES>'
new_gen_data = GeneratorData(training_data_path=new_gen_data_path, delimiter='\t',
cols_to_read=[0], keep_header=True, tokens=tokens)
losses = my_generator.fit(new_gen_data, 1000)
That should bias the generator towards molecules that (structurally) look more like the ones in your dataset but, most importantly, without forgetting the chemical rules it already learned from ChEMBL.
That seems to be closer to what you want, from what I understand.
@gmseabra
Ok, thank you very much for your patience, this may be what I want, I will try it, thank you very much!
Best!
@gmseabra
Ok, thank you very much for your patience, this may be what I want, I will try it, thank you very much!
Best!
Glad to help. Enjoy!
@gmseabra
Sorry, I have another question. When I use my SMILES data training, should I load your trained model and then train? Does generating the model also generate new molecules similar to my SMILES? Like this:
gen_data_path = './data/Mer_gene_data.csv'
tokens = ['<', '>', '#', '%', ')', '(', '+', '-', '/', '.', '1', '0', '3', '2', '5', '4', '7',
'6', '9', '8', '=', 'A', '@', 'C', 'B', 'F', 'I', 'H', 'O', 'N', 'P', 'S', '[', ']',
'\\', 'c', 'e', 'i', 'l', 'o', 'n', 'p', 's', 'r', '\n']
gen_data = GeneratorData(training_data_path=gen_data_path, delimiter='\t',
cols_to_read=[0], keep_header=True, tokens=tokens)
hidden_size = 1500
stack_width = 1500
stack_depth = 200
layer_type = 'GRU'
lr = 0.0001
# optimizer_instance = torch.optim.Adadelta
optimizer_instance = torch.optim.SGD
my_generator = StackAugmentedRNN(input_size=gen_data.n_characters, hidden_size=hidden_size,
output_size=gen_data.n_characters, layer_type=layer_type,
n_layers=1, is_bidirectional=False, has_stack=True,
stack_width=stack_width, stack_depth=stack_depth,
use_cuda=use_cuda,
optimizer_instance=optimizer_instance, lr=lr)
model_path = './checkpoints/generator/checkpoint_biggest_rnn'
my_generator.load_model(model_path)
losses = my_generator.fit(gen_data, 10000)