Exploring Target Datasets: Understanding the Role of Confidence in Reaction Selection

Question

Exploring Target Datasets: Understanding the Role of Confidence in Reaction Selection

Closed this issue a year ago · 9 comments

Hi!@eunjaeshim Could you kindly provide an explanation for the purpose of the variable confidence_selected_rxns in the code, as well as clarify the significance of calculating confidence (conf) in the explore_target_in_batches function? Upon reviewing the code, it seems that the calculated confidence values are not actively utilized or returned in the sections where the function is employed.

Answer 1 · 2024-01-23T15:16:36.000Z

Hello,
Looking at it again, it seems to be part of code that was used for analysis on the side that was deemed unnecessary to be published.
The reaction selection criteria (UBC, greedy etc) should take care of reaction selection based on confidence.
Hope this helped, please let me know if you have other questions.

Answer 2 · 2024-01-23T15:29:47.000Z

Hello, Looking at it again, it seems to be part of code that was used for analysis on the side that was deemed unnecessary to be published. The reaction selection criteria (UBC, greedy etc) should take care of reaction selection based on confidence. Hope this helped, please let me know if you have other questions.

Thank you for your response. So, if I understand correctly, conf only affects rxns_select and does not have an impact on other parts of the code, is that right? If that's the case, I also have some doubts about the confidence strategy. Why would it lean towards selecting reactions with low predicted yields for learning?

Answer 3 · 2024-01-23T15:53:30.000Z

Please check the second paragraph in the Dataset section under Computational Details in our paper.

Answer 4 · 2024-01-23T16:04:33.000Z

All right. Thanks for your patience.

Answer 5 · 2024-01-23T16:26:37.000Z

Sorry to bother you again , I have reorganized the 'greedy' part of the algorithm according to the instructions in the paper, and my understanding is as follows:

First predict the probability p that each target sample yield label is 1.
Calculate with 1-p to get conf;
Then the sequence numbers of the first num_rxns are ranked from smallest to largest as the selected reaction, while the others are unchecked.

idx_rxn_to_run = np.argsort(pred_proba)[:num_rxns]

Since my knowledge of chemistry is rather poor, I don't know why I would choose the reaction with a small probability of predicting that label is 1 for training.

Answer 6 · 2024-01-23T16:31:13.000Z

This is because of the rather strange problem setting due to the limitation in dataset structure. Normally, you would take the maximum num_rxns number of candidates that show the highest probability.

Answer 7 · 2024-01-23T16:37:52.000Z

Okay, thanks for the explanation. I am not using this code for now because I am learning how to use algorithms to predict reaction synthesis conditions for this process. Your article and code are very good to read and the dataset is complete, but got very confused at this place in the process of trying to understand the thought process of this code.

Answer 8 · 2024-01-23T16:47:31.000Z

No problem and thank you for the encouraging words - this was my first computational work.
Yes that exact bit is confusing, but for your use case in the future selecting the entries with the maximum probabilities of being a positive label should work. A follow up experimental evaluation paper (without this confusion) is coming, so please stay tuned.

Answer 9 · 2024-01-23T16:49:52.000Z

Best wishes and looking forward to your new creations