/predicting-molecular-properties

:electron: Machine Learning for Regression on a Quantum Mechanical property

Primary LanguagePythonMIT LicenseMIT

Predicting Molecular Properties

Status License


Prediction of the scalar coupling constant of atom pairs in organic molecules from tabular data using ensembling of gradient boosting trees (XGB) and deep neural networks (DNN) methods in a separate model based meta-architecture. The project used data from the Kaggle competition champs-scalar-coupling.


πŸ“ Table of Contents

🧐 About

Molecular representation with distance matrices and additional generated angle data used for accurate predictions of a quantum mechanical property. XGB and DNNs were found to have comparable accuracy (with XGB generally better) and ensembling these methods with a strongly separated configuration gave satisfactory results. This repository contains all code needed to replicate our results and can be modified for different methods or datasets.

🏁 Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

All requirements are listed in the 'requirements.txt'-file, simply run the following commands:

sudo apt-get install python3.7
sudo apt-get install python3-pip
git clone https://github.com/teamtoll/predicting-molecular-properties.git
cd predicting-molecular-properties
python -m pip install -r requirements.txt

Kaggle API setup: https://github.com/Kaggle/kaggle-api.

Installing

Kaggle Download:

Downloads and extracts all necessary data source files from the Kaggle competition and organizes it into a data_sources directory, ready to use.

cd utils
python kaggle_download.py

Follow any instructions given as output in case of missing files or directories.

Generated files can be downloaded from (place within ./input): https://drive.google.com/file/d/1JN35qpWmMxRAXO28XfLr42ALx1w0Gcia/view?usp=sharing

File Structure

The hierarchy should look like this:

.
β”œβ”€β”€ input                         
β”‚     β”œβ”€β”€ features
|     |    └── ...
β”‚     β”œβ”€β”€ generated
|     |    └── ...
β”‚     └── zipped_source
|          └── ...
β”œβ”€β”€ models                         
β”‚     β”œβ”€β”€ nn
|     β”‚    β”œβ”€β”€ nn_model_1JHC.hdf5
|     |    └── ...
β”‚     └── xgb
|          β”œβ”€β”€ xgb_model_1JHC.hdf5
|          └── ...
β”œβ”€β”€ notebooks                              
β”‚     └── main.ipynb
β”œβ”€β”€ submissions                         
β”‚     └── submission_best.csv
β”œβ”€β”€ utils                         
β”‚     β”œβ”€β”€ other        
|     |    β”œβ”€β”€ distance_matrix.py
|     |    └── ...
β”‚     β”œβ”€β”€ check_repository.py
β”‚     └── ...
|
β”œβ”€β”€ .gitignore
β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md
└── requirements.txt

🎈 Usage

Run the notebook notebooks/main.iypnb, tweak hyper-parameters, change up the data, see where it goes. This repository can also be used as a basis for a completely different problem and dataset.

⛏️ Built Using

✍️ Authors

πŸŽ‰ Acknowledgements