Prediction of the scalar coupling constant of atom pairs in organic molecules from tabular data using ensembling of gradient boosting trees (XGB) and deep neural networks (DNN) methods in a separate model based meta-architecture. The project used data from the Kaggle competition champs-scalar-coupling.
Molecular representation with distance matrices and additional generated angle data used for accurate predictions of a quantum mechanical property. XGB and DNNs were found to have comparable accuracy (with XGB generally better) and ensembling these methods with a strongly separated configuration gave satisfactory results. This repository contains all code needed to replicate our results and can be modified for different methods or datasets.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
All requirements are listed in the 'requirements.txt'-file, simply run the following commands:
sudo apt-get install python3.7
sudo apt-get install python3-pip
git clone https://github.com/teamtoll/predicting-molecular-properties.git
cd predicting-molecular-properties
python -m pip install -r requirements.txt
Kaggle API setup: https://github.com/Kaggle/kaggle-api.
Kaggle Download:
Downloads and extracts all necessary data source files from the Kaggle competition and organizes it into a data_sources directory, ready to use.
cd utils
python kaggle_download.py
Follow any instructions given as output in case of missing files or directories.
Generated files can be downloaded from (place within ./input): https://drive.google.com/file/d/1JN35qpWmMxRAXO28XfLr42ALx1w0Gcia/view?usp=sharing
The hierarchy should look like this:
.
βββ input
β βββ features
| | βββ ...
β βββ generated
| | βββ ...
β βββ zipped_source
| βββ ...
βββ models
β βββ nn
| β βββ nn_model_1JHC.hdf5
| | βββ ...
β βββ xgb
| βββ xgb_model_1JHC.hdf5
| βββ ...
βββ notebooks
β βββ main.ipynb
βββ submissions
β βββ submission_best.csv
βββ utils
β βββ other
| | βββ distance_matrix.py
| | βββ ...
β βββ check_repository.py
β βββ ...
|
βββ .gitignore
βββ LICENSE
βββ README.md
βββ requirements.txt
Run the notebook notebooks/main.iypnb, tweak hyper-parameters, change up the data, see where it goes. This repository can also be used as a basis for a completely different problem and dataset.
- Lars Sandberg @Sandbergo
- Fredrik Bakken @FredrikBakken
- Hallvar GisnΓ₯s @hallvagi
- Lars Aurdal @larsaurdal
- Dennis Christensen @dennis-christensen
- Niels Aase
- Kyle Lobo @kylelobo