CHEMICAL REEACTION YEILD PREDICTION

A Project Report on “CHEMICAL REEACTION YEILD PREDICTION BY DEEP LEARNING MODEL” Submitted in Partial Fulfillment of Requirement for the Award of the Degree of Bachelor of Technology in Chemical Engineering

Project Guide:

Dr. Sudeep Yadav Associate Professor & Head of Department of Chemical Engineering

Submitted By:

Devansh Kasaudhan (2004351010)

Department of Chemical Engineering Bundelkhand Institute of Engineering & Technology, Jhansi-284128, U.P India

Session: 2023-2024

BUNDELKHAND INSTITUTE OF ENGINEERING & TECHNOLOGY, JHANSI

CERTIFICATE

I hereby declare that the work is presented in this project entitled " CHEMICAL REEACTION YEILD PREDICTION BY DEEP LEARNING MODEL " in partial fulfilment for the award of the degree of Bachelor of Technology in Chemical Engineering submitted in the Department of Chemical Engineering Bundelkhand Institute of Engineering & Technology, Jhansi (UP.), India (Affiliated to A.P.J. Abdul Kalam Technical University, Lucknow) is my own work carried out, under the guidance of Dr. Sudeep Yadav, Associate Professor and Head of Department of Chemical Engineering BIET Jhansi.

Devansh Kasaudhan (2004351010)

Date: 15-05-2024

This is to certify that the above statement made by the candidates are correct to the best of our knowledge.

Dr. Sudeep Yadav

Associate Professor & Head of Department of Chemical Engineering B.I.E.T Jhansi (U.P.), India

DECLARATION BY THE CANDIDATES

I, Devansh Kasaudhan hereby declare that the project report entitled "CHEMICAL REEACTION YEILD PREDICTION BY DEEP LEARNING MODEL", under the guidance of Dr. Sudeep Yadav is submitted in the fulfilment of the requirements for the MAJOR PROJECT. This is a bonafide work carried out by me and the results embodied in this project report have not been reproduced/copied from any source. The results embodied in this project report have not been submitted to any other university or institution for the award of any other degree or diploma.

Date: 15^th May, 2024

Place: Chemical Engineering

Department, Biet, Jhansi

ACKNOWLEDGEMENT

I feel great pleasure in expressing our deep sense of gratitude and heartiest respect to our project supervisor Dr. Sudeep Yadav, Associate Professor Department of Chemical Engineering and Er. Shashank Gupta Assistant Professor of Department of Computer Science and Engineering for their preserving guidance and inspiration throughout the preparation of this project report.

I take this opportunity to express my thanks and respect to my parents and sister & also our friends and classmates whose never-ending affection and encouragement have brought me to this level of academic carrier.

Thanking You

Devansh Kasaudhan (2004351010)

ABSTRACT

In this Project, BERT (Bidirectional Encoder Representations from Transformers) is applied in developing predictive models for chemical reaction yields using High Throughput Reaction (HTR) Datasets. Evaluating the efficacy of utilizing the SMILES (Simplified Molecular Input Line Entry System) nomenclature within the BERT framework is the concern. I conducted experiments on three distinct datasets: the Buchwald-Hartwig HTE dataset comprising Pd-catalyzed Buchwald–Hartwig C-N cross coupling reactions, the Suzuki-Miyaura HTE dataset consisting of Pd-catalyzed C-C cross coupling reactions, and the USPTO dataset sourced from the United States Patent and Trademark Office (USPTO). With R² values of 0.9485 and 0.8310, the first model demonstrates a higher level of explanatory power compared to the second model.

In the future, other researchers can explore the effect of other advanced NLP models Based on GPT (Generative Pre-trained Transformer) for enhanced predictive capabilities in chemical reaction yield estimation could offer valuable insights.

LIST OF TABLES

Table No.	Table Name
1	Platform Details
2	Original Training Data Buchwald-Hartwig
3	Preprocessed Training Data Buchwald-Hartwig
4	Example Smiles Tokens
5	Training Parameter Buchwald Hartwig training
6	Training Parameter Suzuki Miyaura training
7	Training Parameter USTPO Dataset
8	Buchwald-Hartwig Test Prediction
9	Suzuki Miyaura Test Predictions

LIST OF FIGURES

Figure No.	Figure Name
1	Datasets Overview
2	Pipeline and process overview
3	Frequency V/S Yield Buchwald-Hartwig Dataset
4	Number of reactions for each Base
5	Frequency V/S Yield Suzuki-Miyaura Dataset
6	Frequency V/S Yield USPTO Dataset
7	Molecule to smiles

TABLE OF CONTENT

Sr. No.	Title
1.	Certificate
2.	Declaration by the Candidates
3.	Acknowledgement
4.	Abstract
5.	List of Tables
6.	List of Figures

Chapter 1………………….……………………………………………………………………

Introductions

Chapter 2…………….…………………………………………………………………………

Literature Review

Chapter 3……………………………………………………………………………………....

Problem Statement and Methods
1. Research Gap
2. Objectives and Proposed Solution
3. Methodology
4. Pipeline and process overview

Chapter 4……………………………………………………………………………………....

Dataset Overview
1. Buchwald-Hartwig Amination Overview
2. Suzuki-Miyaura Reaction Data Set
3. USPTO Dataset

Chapter 5……………………………………………………………………………………....

Training Overview
1. Training Method
2. Test 1 on Buchwald-Hartwig Amination Dataset
3. Test 2 on Suzuki-Miyaura Reaction Dataset
4. Test 3 on USPTO Dataset

Chapter 6…………………………………………………………………………………….....

Results and Discussion

Chapter 7…………………………………………………………………………………….....

Conclusion and Future Scope
1. Problems
2. Conclusion
3. Future Scope

References…………………………………………………………………………….…………

CHAPTER 1: INTRODUCTION

Introduction

The present growth in artificial intelligence is giving an extra edge to related fields. Most fields are adopting AI techniques to improve and optimize processes. Software industries are directly implementing AI models for customer support and convenience. Many other fields that have a larger domain to cover are using AI to reduce research time. Eventually, non-IT domains like chemical processing are also adopting some AI models in process optimization and control. According to Stanford University’s Artificial Intelligence Index Report 2024, “Fifteen important machine learning models came from academia in 2023, 51 generated by industry. Additionally, 21 noteworthy models - a new high - came from industry-academia collaborations in 2023.” [8] This is a considerable improvement in base model creation, but unfortunately, in these large models, chemical processing models are still lacking considerably. Researchers try to convert some chemical domain problems to natural language tasks to take advantage of these recent research developments. The ability to predict the outcomes of chemical reactions accurately can significantly accelerate research and development in various industries, leading to faster innovation cycles and more efficient discovery processes.

In this project, BERT (Bidirectional Encoder Representations from Transformers) is applied to develop predictive models for chemical reaction yields using High Throughput Reaction (HTR) datasets. Experiments were conducted on three distinct datasets: the Buchwald-Hartwig HTE dataset, the Suzuki-Miyaura HTE dataset, and the USPTO dataset. The first model achieved R2 values of 0.9485 and 0.8310, demonstrating a higher level of explanatory power compared to the second model. Future research could explore the effect of advanced NLP models based on GPT (Generative Pre-trained Transformer) for enhanced predictive capabilities in chemical reaction yield estimation.

The use of advanced AI models like BERT for predicting chemical reaction yields holds great promise for the chemical industry. These models can significantly speed up research and development by providing quick and accurate predictions of reaction outcomes. This leads to faster innovation cycles and reduces costs by minimizing the need for extensive experimental trials. By optimizing reaction conditions, AI can improve process efficiency and sustainability, and even uncover new reactions and synthetic pathways. Additionally, integrating AI with automated lab equipment can streamline research further, while also promoting knowledge sharing and collaboration across different domains. Overall, AI-driven predictive models have the potential to transform chemical research and manufacturing, enhancing efficiency, fostering innovation, and reducing environmental impact.

Important Points

AI Adoption:
1. Rapid AI growth enhances fields, including chemical processing for optimization and control.
AI Model Development:
1. Significant contributions from academia and industry, with increased collaborations.
BERT in Reaction Prediction:
1. BERT used to predict chemical reaction yields with High Throughput Reaction datasets.
2. High R2 values (0.9485 and 0.8310) achieved on Buchwald-Hartwig HTE, Suzuki-Miyaura HTE, and USPTO datasets.
Future Research:
1. Explore advanced NLP models based on GPT for better yield prediction.
Industry Benefits:
1. AI accelerates R&D, reduces costs, and optimizes reactions.
2. Potential for discovering new reactions and improving efficiency, innovation, and sustainability in chemical manufacturing.

CHAPTER 2: LITERATURE REVIEW

Literature Review

S. No.	Authors	Description of Paper	My Findings
1	Perera, D. (2018) [1]	Development of an computerized flow-based manufacture platform for pharmaceutical research. Integration of nanomole-scale reaction screening with micromole-scale synthesis. Validation through Suzuki-Miyaura coupling reactions at elevated temperatures. Capability to generate over 1500 reactions per 24 hours. Direct production of micromole quantities of desired material through multiple injections. Replication of optimal conditions in traditional flow and batch modes for larger scale synthesis.	Suzuki-Miyaura coupling reactions are pivotal for creating carbon-carbon bonds. High-throughput capabilities, such as conducting over 1500 reactions per day.
2	Ahneman, D. T. (2018) [2]	Machine learning (ML) is increasingly essential in scientific research across various fields. ML can predict synthetic reaction outcomes using high-throughput experimental data. Scripts can calculate atomic, molecular, and vibrational descriptors for reaction components. These descriptors serve as input for ML models to predict reaction yields. A random forest algorithm outperforms linear regression in predictive accuracy. The random forest model works well with limited data and can generalize to new, unseen data. This approach can streamline the adoption of new synthetic methods in chemistry.	Buchwald-Hartwig amination reaction prediction via Machine Learning Methods Reaction data works well with machine learning model.
3	Devlin, J. (2019) [3]	BERT (Bidirectional Encoder Representations from Transformers) is designed to pre-train deep bidirectional representations. It trains on both left and right context. BERT can be fine-tuned with one extra output layer for various tasks without major architecture changes. It has achieved state-of-the-art results on 11 NLP tasks. Improvements include a 7.7% increase in GLUE score and a 4.6% increase in MultiNLI accuracy. BERT also improved SQuAD v1.1 and v2.0 Test F1 scores by 1.5 and 5.1 points, respectively.	BERT Foundational model architecture Fine-tuning BERT on custom Dataset
4	Schwaller P (2020) [4]	Data augmentation can significantly enhance the accuracy of chemical reaction yield predictions, even with limited data sets. Utilizing natural language processing (NLP) techniques, such as reaction transformer models, offers a novel approach to predict yields using text-based representations of chemical reactions. Models trained on augmented data sets that include as little as 2.5% of the total data can surpass the performance of traditional models that rely on molecular fingerprints or physics-based descriptors.	Different Nomenclature for reaction notation SMILS Notation overview
5	Weininger, D. (1988) [5]	SMILES are linear notation for encoding molecular structures using ASCII strings. It was introduced in the 1980s and has been modified and extended since then, with OpenSMILES being developed in 2007. SMILES strings can be introduced by maximum molecule editors for conversion into two-dimensional diagrams or three-dimensional models. The system is designed for high-speed machine processing, making it suitable for various chemical informatics applications.	SMILES Notation overview Other reaction notations
6	Wolf, T. (2019) [6]	Transformers are crucial in modern NLP, providing tools for handling a broad range of language processing jobs. The open-source library offers access to large-scale pretrained models, which can be further customized and deployed. It supports interoperability between different frameworks like PyTorch, TensorFlow, and JAX, enhancing flexibility for users. The library is backed by a community that contributes to its continuous development and the addition of new models.	Hugging Face transformers overview Development and deployment of NLP model.
7	Landrum, G. (2019) [7]	Molecules can be converted to text using functions for SMILES, Kekule SMILES, and MDL Mol blocks, with options to include names and stereochemistry. Conformer Generation: RDKit can generate 2D and 3D conformers using methods like ETKDG, and supports multiple conformer generation and optimization. Chemical Reactions: RDKit supports applying chemical reactions to molecules using a SMARTS-based language or from MDL rxn files.	RDKit documentation overview Use case and application of RDKit.
8	Artificial Intelligence Index Report 2024 (2024) [8]	AI has excelled in tasks such as image classification, visual reasoning, and English understanding but still faces challenges in areas like contest-level math and visual commonsense reasoning. The industry sector produced a significant number of machine learning models in 2023, indicating its dominance in AI research and development. The cost of training advanced AI models like OpenAI's GPT-4 and Google's Gemini Ultra has soared, reflecting the increasing expenses associated with cutting-edge AI development. The United States has developed as a front-runner in AI innovation, contributing the majority of notable AI models compared to other regions such as the EU and China.	Current AI trends and future prospect Large language models

CHAPTER 3: PROBLEM STATEMENT AND METHODS

Research Gap

Limited AI Models in Chemical Processing: Few AI models are tailored specifically for chemical processing.
Advanced NLP Model Integration: Potential of GPT-based NLP models for predicting chemical reaction yields needs exploration.
Scalability and Generalization Challenges: Models struggle with scalability and generalization across diverse chemical reactions and datasets.
Automated Experimental Validation: Integrating AI models with automated lab equipment for validation is underexplored.

Objective

To develop a predictive model using BERT (Bidirectional Encoder Representations from Transformers) on High Throughput Reaction (HTR) Datasets.
To Assess the effectiveness of the SMILES (Simplified Molecular Input Line Entry System) nomenclature within the BERT framework.

Solution

Schwaller et al proposed employing deep learning models to predict the yield of chemical reactions based on text-based representations of the reactions. This approach can demonstrate excellent results with our models on two high-throughput experiment reaction sets and one yield distribution depending on mass scale:

Methodology

Acquiring Data

Publicly available sources of chemical reactions and yields, such as the USPTO database and HTE datasets.

Preprocessing

Clean and normalize the data, converting reactions into text-based representations by SMILES representation.

Splitting and Balancing the Dataset

Spliting the data into training, validation, and test sets, ensuring balanced distributions of yield values.

Building the Model

Utilize a Deep Learning Based Model, particularly the BERT Model.

Evaluation

Develop a loss function that accounts for both accuracy and uncertainty, assessing performance using metrics like Mean Absolute, Root Mean Squared, and R² score. Hyperparameter tuning can be done through grid search and Bayesian optimization methods.

Fig. 1 Datasets Overview

Pipeline and process overview

Figure 4 describes a BERT fine-tuning pipeline specifically designed for predicting chemical reaction yields. Flowchart in Figure also includes a chemical structure in SMILES notation under “Reaction SMILES,” which represents the input data format for the model. The process aims to correctly predict the yields of chemical reactions using BERT architecture, which has been extended and fine-tuned for this specific task.

Fig-2 Pipeline and process overview

Models Stage:
1. Reaction BERT: Utilizes “rxnfp” to encode chemical reactions.
2. Regression Extension: Employs “simpletransformers” to adapt BERT for yield prediction.
Data Split/Hyperparameters Stage:
1. Train/Validation: The dataset is split for training and validation.
2. Hyperparameter Selection: Optimal hyperparameters are chosen.
3. Final Model Training: The model is trained with the selected hyperparameters.
Evaluation Stage:
1. Final Model Predictions: The fine-tuned model predicts yields on test data.
2. Accuracy: The model, referred Yield-BERT, reportedly achieves 96.6% accuracy in predicting reaction yields for specific type of reactions.

CHAPTER 4: DATASET OVERVIEW

Buchwald-Hartwig Amination Overview

The Buchwald-Hartwig amination is a potent chemical reaction used for synthesizing carbon-nitrogen bonds. It involves the palladium-catalyzed coupling reactions of amines with aryl halides, resulting in the formation of carbon-nitrogen bonds (C-N bonds) in organic molecules. Key components include aryl halides, amines, catalysts (typically palladium-based), and bases. The mechanism proceeds through a series of steps involving oxidative addition and reductive elimination. [1]

Fig-3 Frequency V/S Yield Buchwald-Hartwig Dataset

Machine Learning in Chemical Reaction Prediction

Machine learning significantly enhances the prediction of reaction outcomes in complex chemical processes. Doyle et al. employed machine learning to estimate C-N cross-coupling reaction outcomes, crucial in pharmaceutical synthesis. Our model, trained on 4,140 reactions, integrates diverse factors like aryl halides and ligands, accelerating the development of synthetic methods by accurately predicting yields. [2]

Fig 4: Number of reactions for each Base

In this horizontal bar graph, the x-axis of the graph ranges from 0 to 20,000, which represents frequencies of reactions in corresponding bases. According to this there are almost 20 thousand reactions for each of the bases. [1]

Suzuki-Miyaura Reaction Data Set

Substrates generally include aryl or vinyl halides as electrophiles and aryl or vinyl boron compounds as nucleophiles. Use a catalyst such as PdCl₂(PPh₃)₂ or Pd(OAc)₂ are used and a strong base such as potassium carbonate or cesium carbonate to neutralize acidic byproducts. To avoid side effects, use an aprotic solvent such as DMF, DMSO, or toluene. This reaction occurs at moderate temperatures, usually between 60°C and 100°C. [3]

Fig-2 Frequency V/S Yield Suzuki-Miyaura Dataset

USPTO Dataset

Machine learning significantly enhances the prediction of reaction outcomes in complex chemical processes. Reactions obtained by text- digging from United States patents published between 1976 and September 2016 are available. The largest shared reaction dataset was text- extracted from US patents by Lowe. However, the dataset may not be ideal for a yield prediction task due to noisy yield data in patents.

Fig-3 Frequency V/S Yield USPTO Dataset

Molecule to SMILE Nomenclature

SMILES, which stands for Simplified Molecular-Input Line-Entry System, is a shorthand notation used to represent chemical structures using brief ASCII strings. These SMILES strings can be readily imported by most molecular editors, enabling conversion back into two-dimensional drawings or three-dimensional models of the molecules. Here is an example illustrating the SMILES system. [6]

CC(C)C(C=C(C(C)C)C=C1C(C)C)=C1C2=C(P([C@@]3(C[C@@H]4C5)C[C@H](C4)C[C@H]5C3)[C@]6(C7)C[C@@H](C[C@@H]7C8)C[C@@H]8C6)C(OC)=CC=C2OC

Fig:5 Molecule to smiles

CHAPTER 5: TRAINING OVERVIEW

Training Overview

The use of SMILES notation allows for a standardized way of representing chemical structures, which is crucial for machine learning models that require consistent input formats. By setting up parameters such as the learning rate and dropout rate, the script ensures that the model can learn effectively without overfitting.

Loading a pre-trained BERT model is an NLP (Natural language processing) approach, leveraging transfer learning to reduce the number of computational resources needed and potentially improving the model's functioning on chemical data. The “SmilesClassificationModel” class tailored for handling unique aspects of chemical data represented in SMILES format. By setting it up for regression with a single output label, the model is focusing on predicting a specific property Chemical Yield of a chemical reaction.

Training the model on a dataset and validating it on other helps in assessing the model's generalization capabilities. Saving the training output to a directory allows for easy access and analysis of the model's performance over time. The inclusion of metrics like training loss and R2 score _provides insight into the model's predictive accuracy and how well it fits the data.

Downloading and zipping the trained model's outputs is a practical step for model deployment or sharing. It ensures that the model can be easily transported and utilized in different environments or platforms. The final step of loading the trained model and using it to predict outcomes of chemical reactions showcases the practical application of the script. By printing the predictions and raw outputs to the console, it offers immediate feedback on the model's execution on new data.

The script embodies a comprehensive workflow for Deep learning in chemical reaction prediction, from model setup and training to validation and performance tracking. It reflects a deep integration of Deep learning techniques with domain-specific knowledge, highlighting the potential of AI to revolutionize fields like Chemical and drug discovery.

All Scripts are written in Jupyter Notebook format and executed on Google Colab platform. Details of training platform is given in Table 1 below:

Table 1: Platform Details

Platform	Google Colab
Training Machine	NIVIDIA T4 GPU
Programming Language	Python
Framework	PyTorch
Utility Library	simpletransformers, RDKit
Model	BertForSequenceClassification
Important Classes	SmilesClassificationModel, SmilesTokenizer

For the base model, we utilized the “BertForSequenceClassification” Large Language Model. We employed PyTorch as the deep learning framework. Additionally, we incorporated utility libraries such as “simpleTransformers” and “RDKit” for Fine tuning BERT Model.

All Scripts have two main classes “SmilesTokenizer” and “SmilesClassificationModel”. SmilesTokenizer is used for tokenizing reaction input which is derived from “BertTokenizer” of “transformers” library and “SmilesClassificationModel” is the model class which is derived from “ClassificationModel” class of “simpletransformers” classification library.

Training is executed on T4 GPUs from Google Colab Platform (Python 3 Google Compute Engine backend (GPU)).

Table 2: Original Training Data Buchwald-Hartwig

Ligand	Additive	Base	Aryl halide	Output
CC(C)C(C=C(C(C)C)C=C1C(C)C)=C1C2=C(P([C@@]3(C[...	CC1=CC(C)=NO1	CN(C)P(N(C)C)(N(C)C)=NP(N(C)C)(N(C)C)=NCC	ClC1=NC= CC=C1	70.410458
CC(C)C(C=C(C(C)C)C=C1C(C)C)=C1C2=C(P([C@@]3(C[...	O=C(OC)C1=CC=NO1	CN(C)P(N(C)C)(N(C)C)=NP(N(C)C)(N(C)C)=NCC	BrC1=NC= CC=C1	11.064457
CC(C)C(C=C(C(C)C)C=C1C(C)C)=C1C2=C(P(C3CCCCC3)...	O=C(OC)C1=CC=NO1	CN(C)P(N(C)C)(N(C)C)=NP(N(C)C)(N(C)C)=NCC	IC1=CC=C(C C)C=C1	10.223550
CC(C)C(C=C(C(C)C)C=C1C(C)C)=C1C2=C(P(C(C)(C)C)...	CCOC(C1=CON=C1)=O	CN1CCCN2C1=NCCC2	ClC1=CC=C(C (F)(F)F)C=C1	20.083383
CC(C)C(C=C(C(C)C)C=C1C(C)C)=C1C2= C(P([C@@]3(C[...	CC1=CC(C)=NO1	CN1CCCN2C1=NCCC2	ClC1=CC=C (OC)C=C1	0.492663

Ahneman et al (2018)

Dataset contains information about ligands, additives, bases, aryl halides, and corresponding output Yield of reaction. Each row represents a combination of these components along with an associated Yield value.

Ligand: Chemical structure or identifier of the ligand used in the reaction.
Additive: Chemical structure or identifier of the additive used in the reaction.
Base: Chemical structure or identifier of the base used in the reaction.
Aryl halide: Chemical structure or identifier of the aryl halide used in the reaction.
Output: Chemical reaction Yield in percentage.

Table 3: Preprocessed Training Data Buchwald-Hartwig

Ligand	Additive	Base	Aryl halide	Output	rxn
CC(C)C(C=C(C(C)C)C=C1C(C)C)=C1C2=C(P([C@@]3(C[...	CC1=CC(C)=NO1	CN(C)P(N(C)C)(N(C)C)=NP(N(C)C)(N(C)C)=NCC	ClC1=NC=CC=C1	70.410458	Clc1ccccn1.Cc1ccc(N)cc1.O=S(=O)(O[Pd]1c2ccccc2...
CC(C)C(C=C(C(C)C)C=C1C(C)C)=C1C2=C(P([C@@]3(C[...	O=C(OC)C1=CC=NO1	CN(C)P(N(C)C)(N(C)C)=NP(N(C)C)(N(C)C)=NCC	BrC1=NC=CC=C1	11.064457	Brc1ccccn1.Cc1ccc(N)cc1.O=S(=O)(O[Pd]1c2ccccc2...
CC(C)C(C=C(C(C)C)C=C1C(C)C)=C1C2=C(P(C3CCCCC3)...	O=C(OC)C1=CC=NO1	CN(C)P(N(C)C)(N(C)C)=NP(N(C)C)(N(C)C)=NCC	IC1=CC=C(CC)C=C1	10.223550	CCc1ccc(I)cc1.Cc1ccc(N)cc1.O=S(=O)(O[Pd]1c2ccc...
CC(C)C(C=C(C(C)C)C=C1C(C)C)=C1C2=C(P(C(C)(C)C)...	CCOC(C1=CON=C1)=O	CN1CCCN2C1=NCCC2	ClC1=CC=C(C(F)(F)F)C=C1	20.083383	FC(F)(F)c1ccc(Cl)cc1.Cc1ccc(N)cc1.O=S(=O)(O[Pd...
CC(C)C(C=C(C(C)C)C=C1C(C)C)=C1C2=C(P([C@@]3(C[...	CC1=CC(C)=NO1	CN1CCCN2C1=NCCC2	ClC1=CC=C(OC)C=C1	0.492663	COc1ccc(Cl)cc1.Cc1ccc(N)cc1.O=S(=O)(O[Pd]1c2cc...

This is the preprocessed dataframe of Table 2 Data in this a reaction column is added by “generate_buchwald_hartwig_rxns” function. Reaction SMILES are a compact textual representation of chemical reactions. It defines a forward template “fwd_template” representing the Buchwald-Hartwig reaction pattern using SMARTS (SMILES Arbitrary Target Specification) notation. This template describes a general reaction pattern where an aryl halide reacts with an amine in the presence of a palladium catalyst to form a new bond. [10]

Table 4: Example Smiles Tokens

[UNK]	[CLS]	[SEP]	[MASK]	c	C	(	)
O	1	2	=	N	.	n	3
F	Cl	>>	~	-	4	[C@H]	S
[C@@H]	[O-]	Br	#	/	[nH]	[N+]	s
5	O	P	[Na+]	[Si]	I	[Na]	[Pd]
[K+]	[K]	[P]	B	[C@]	[C@@]	[Cl-]	6
[OH-]	\	[N-]	[Li]	[H]	[2H]	[NH4+]	[c-]
[P-]	[Cs+]	[Li+]	[Cs]	[NaH]	[H-]	[O+]	[BH4-]

Ahneman et al (2018)

This is an example table of tokens for the BERT distribution model with regression sensitivity. Some tokens are given in this table, but there are a total of 519 tokens in the project repository. These tokens are converted into numbers for modeling purposes. During model training, these digital tokens are fed into the BERT model, which learns to associate a set of tokens with regression results.

Test 1 on Buchwald-Hartwig Amination Dataset

Table-5 Training Parameter Buchwald Hartwig training

Parameter	Value
num_train_epochs	10
overwrite_output_dir	True
gradient_accumulation_steps	1
Regression	True
num_labels	1
fp16	False
evaluate_during_training	True
manual_seed	42
max_seq_length	300
train_batch_size	16
warmup_ratio	0.00
hidden_dropout_prob	0.7987

These are the operational parameters for BERT (Bidirectional Encoder Representations from Transformers) model for sequence classification model in which we set our regression is true for regression dataset. These values can be changed in script given in references [1]. This Model is trained on Buchwald-Hartwig Dataset.

Test 2 on Suzuki-Miyaura Reaction Dataset

Table-6 Training Parameter Suzuki Miyaura training

Parameter	Value
num_train_epochs	15
overwrite_output_dir	True
gradient_accumulation_steps	1
Regression	True
num_labels	1
fp16	False
evaluate_during_training	True
manual_seed	42
max_seq_length	300
train_batch_size	16
warmup_ratio	0.00
hidden_dropout_prob	0.7987

These are the operational parameters for BERT (Bidirectional Encoder Representations from Transformers) model for sequence classification model in which we set our regression is true for regression dataset. These values can be changed in script given in references. This Model is trained on Suzuki Miyaura Dataset. [1]

Test 3 on USPTO Dataset

Table7- Training Parameter USTPO training

Parameter	Value
num_train_epochs	6
overwrite_output_dir	True
gradient_accumulation_steps	1
Regression	True
num_labels	1
fp16	False
evaluate_during_training	True
manual_seed	42
max_seq_length	300
train_batch_size	64
warmup_ratio	0.00
hidden_dropout_prob	0. 7987

CHAPTER 6: RESULTS AND DISCUSSION

Table-8 Buchwald-Hartwig Test Predictions

Reaction	Predicted Yield (%)	True Yield (%)
Brc1ccccn1.Cc1ccc(N)cc1.O=S(=O)(O[Pd]1c2ccccc2-c2ccccc2N~1)C(F)(F)F.CC(C)c1cc(C(C)C)c(-C2ccccc2P(C2CCCCC2)C2CCCCC2)c(C(C)C)c1.CCN=P(N=P(N(C)C)(N(C)C)N(C)C)(N(C)C)N(C)C.CCOC(=O)c1cc(C)no1>>Cc1ccc(Nc2ccccn2)cc1	48.2	37.7
Clc1ccccn1.Cc1ccc(N)cc1.O=S(=O)(O[Pd]1c2ccccc2-c2ccccc2N~1)C(F)(F)F.CC(C)c1cc(C(C)C)c(-c2ccccc2P(C(C)(C)C)C(C)(C)C)c(C(C)C)c1.CCN=P(N=P(N(C)C)(N(C)C)N(C)C)(N(C)C)N(C)C.c1ccc(-c2cnoc2)cc1>>Cc1ccc(Nc2ccccn2)cc1	37.7	31.3
CCc1ccc(Br)cc1.Cc1ccc(N)cc1.O=S(=O)(O[Pd]1c2ccccc2-c2ccccc2N~1)C(F)(F)F.COc1ccc(OC)c(P(C(C)(C)C)C(C)(C)C)c1-c1c(C(C)C)cc(C(C)C)cc1C(C)C.CCN=P(N=P(N(C)C)(N(C)C)N(C)C)(N(C)C)N(C)C.c1ccc(CN(Cc2ccccc2)c2ccno2)cc1>>CCc1ccc(Nc2ccc(C)cc2)cc1	59.0	60.6
CCc1ccc(Cl)cc1.Cc1ccc(N)cc1.O=S(=O)(O[Pd]1c2ccccc2-c2ccccc2N~1)C(F)(F)F.CC(C)c1cc(C(C)C)c(-c2ccccc2P(C2CCCCC2)C2CCCCC2)c(C(C)C)c1.CN1CCCN2CCCN=C12.c1ccc(-c2ccno2)cc1>>CCc1ccc(Nc2ccc(C)cc2)cc1	2.5	3.9
Brc1cccnc1.Cc1ccc(N)cc1.O=S(=O)(O[Pd]1c2ccccc2-c2ccccc2N~1)C(F)(F)F.COc1ccc(OC)c(P(C(C)(C)C)C(C)(C)C)c1-c1c(C(C)C)cc(C(C)C)cc1C(C)C.CN1CCCN2CCCN=C12.CCOC(=O)c1cc(C)on1>>Cc1ccc(Nc2cccnc2)cc1	89.9	86.4

This Table shows how our First model predict Buchwald-Hartwig Reaction Yields. BERT model Predict reaction Yield and gives approximate results. As we can see there are not very much difference in True Yield and Predicted Yield. R² value of this Model is 0. 9485 which is a good value for most of the LLM Benchmarks. This shows that SMILES nomenclature is very well adaptive by BERT (Bidirectional Encoder Representations from Transformers). [1]

Table-9 Suzuki Miyaura Test Predictions

Reaction	Predicted Yield (%)	True Yield (%)
C1CCC(P(C2CCCCC2)C2CCCCC2)CC1.CC(=O)OCC(=O) O[Pd].CCc1cccc(CC)c1.CO.Cc1ccc2c (cnn2C2CCCCO2)c1B1OC(C)(C)C(C)(C)O1.Ic1ccc2nc ccc2c1.O>>Cc1ccc2c(cnn2C2CCCCO2)c1-c1ccc2ncccc2c1	0.7	0.7
CC#N.CC(=O)O~~CC(=O)O~~[Pd].CCN(CC)CC.CCc1cccc (CC)c1.Cc1ccc2c(cnn2C2CCCCO2)c1[B-] (F)(F)F.O.O=S(=O)(Oc1ccc2ncccc2c1)C(F)(F)F.[K+]. c1ccc(P(c2ccccc2)c2ccccc2)cc1>>Cc1ccc2c(cnn2C2CCCCO2)c1-c1ccc2ncccc2c1	0.1	0.2
CC(=O)O~~CC(=O)O~~[Pd].CC(C)(C)P(C(C)(C)C)C(C)(C)C. CCc1cccc(CC)c1.CO.Cc1ccc2c(cnn2 C2CCCCO2)c1B1OC(C)(C)C(C)(C)O1.O.O=S(=O)(Oc1ccc2ncc cc2c1)C(F)(F)F.[K+].[OH-]>>Cc1ccc2c(cnn2C2CCCCO2)c1-c1ccc2ncccc2c1	0.3	0.2
CC(=O)O~~CC(=O)O~~[Pd].CCc1cccc(CC)c1.CO.Cc1ccc2c(cnn 2C2CCCCO2)c1B1OC(C)(C)C(C)(C) O1.Ic1ccc2ncccc2c1.O.O=P([O-])([O-])[O-].[Fe+2].[K+]. [K+].[K+].c1ccc(P(c2ccccc2)[c-]2cccc2)cc1.c1ccc(P(c2ccc cc2)[c-]2cccc2)cc1>>Cc1ccc2c(cnn2C2CCCCO2)c1-c1ccc2ncccc2c1	0.8	0.8
Brc1ccc2ncccc2c1.CC(=O)O~~CC(=O)O~~[Pd].CC(C)(C)P(C1=CC=C[ CH]1)C(C)(C)C.CC(C)(C)P( C1=CC=C[CH]1)C(C)(C)C.CCc1cccc(CC)c1.CO.Cc1ccc2c(cnn2C2CC CCO2)c1B1OC(C)(C)C(C)(C)O1.O.[Fe]>>Cc1ccc2c(cnn2C2CCCCO2)c1-c1ccc2ncccc2c1	0.1	0.2

This Table shows how our second model predict Suzuki Miyaura Reaction Yields. BERT model Predict reaction Yield and gives approximate results. As we can see there are not very much difference in True Yield and Predicted Yield. R² value of this Model is 0.8310 which is lower than previous value. This difference is due to precision of data points not because of chemical reactions. [1]

CHAPTER 7: CONCLUSION AND FUTURE SCOPE

Problems

Limited Training Data scope: Model may struggle to generalize beyond coupling reactions.
Chemical Properties Neglected: Important factors like steric hindrance and electronic effects not considered.
Limited Resources: Scarcity of research and data impedes model development.
NLP Model Usage: NLP models lack specialized chemical knowledge.
Data Scarcity: Insufficient datasets hinder model performance and generalization.

Conclusion

R² values of 0.9485 and 0.8310 predictive performance using BERT on HTR datasets.
Simple Model generalization to new reactions.
Insights into influential factors through BERT's attention mechanism.
Evaluation of SMILES effectiveness compared to other representations.
Need of Chemical informatics for reaction prediction.
Potential applications in reaction optimization and drug discovery.

Future Scope

Explore Advanced NLP Models: Look into models like GPT for improved prediction.
Integrate Domain Knowledge: Combine chemical principles to enhance model understanding.
Hybridize Techniques: Combine NLP with other ML methods for better accuracy.
Expand Applications: Apply NLP models in drug discovery and materials science for faster innovation.

REFERENCES

Perera, D., Tucker, J. W., Brahmbhatt, S., Helal, C. J., Chong, A., Farrell, W., Sach, N. W. (2018). A platform for automated nanomole-scale reaction screening and micromole-scale synthesis in flow. Science, 359(6374), 429–434. doi:10.1126/science.aap9112
Ahneman, D. T., Estrada, J. G., Lin, S., Dreher, S. D., & Doyle, A. G. (2018). Predicting reaction performance in C–N cross-coupling using machine learning. Science, 360(6385), 186–190. doi:10.1126/science.aar5169
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805v2 [cs.CL]
Schwaller P, Vaucher AC, Laino T, Reymond J-L. Data augmentation strategies to improve reaction yield predictions and estimate uncertainty. ChemRxiv. 2020; doi:10.26434/chemrxiv.13286741.v1
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Model. 28, 31–36 (1988).
Wolf, T. et al. Huggingface’s transformers: State-of-the-art natural language processing. ArXivabs/1910.03771 (2019).
Landrum, G. et al. rdkit/rdkit: 2019_03_4 (q1 2019) release (2019). URL https://doi.org/10.5281/zenodo.3366468.
AI Index Steering Committee. (2024). Artificial Intelligence Index Report 2024. Stanford University. Retrieved from Stanford HAI.
Sandfort, F., Strieth-Kalthoff, F., Kühnemund, M., Beecks, C., & Glorius, F. (2020). A Structure-Based Platform for Predicting Chemical Reactivity. *Chem*, 6(6), 1379-1390. https://doi.org/10.1016/j.chempr.2020.02.017

devanshkasaudhan/Chemical-Reaction-Yield-Prediction

CHEMICAL REEACTION YEILD PREDICTION

Project Guide:

Submitted By:

CERTIFICATE

DECLARATION BY THE CANDIDATES

ACKNOWLEDGEMENT

ABSTRACT

LIST OF TABLES

LIST OF FIGURES

TABLE OF CONTENT

CHAPTER 1: INTRODUCTION

Introduction

Important Points

CHAPTER 2: LITERATURE REVIEW

Literature Review

CHAPTER 3: PROBLEM STATEMENT AND METHODS

Research Gap

Objective

Solution

Methodology

Acquiring Data

Preprocessing

Splitting and Balancing the Dataset

Building the Model

Evaluation

Pipeline and process overview

CHAPTER 4: DATASET OVERVIEW

Buchwald-Hartwig Amination Overview

Machine Learning in Chemical Reaction Prediction

Suzuki-Miyaura Reaction Data Set

USPTO Dataset

Molecule to SMILE Nomenclature

CHAPTER 5: TRAINING OVERVIEW

Training Overview

Test 1 on Buchwald-Hartwig Amination Dataset

Test 2 on Suzuki-Miyaura Reaction Dataset

CHAPTER 6: RESULTS AND DISCUSSION

CHAPTER 7: CONCLUSION AND FUTURE SCOPE

Problems

Conclusion

Future Scope

REFERENCES