/gc-ms_bart

Primary LanguageJupyter Notebook

SpecTRUM: Spectral Translator for the Reconstruction of Unknown Molecules

This project trains a Transformer model to tackle the task of de novo GC-MS spectra analysis.

Environment setting

The conda environment files are in the env_specification folder. BARTtrainH100 is the main environment used for data preprocessing, training and evaluation. The NEIMSpy3_environment is specifically used only for NEIMS spectra generation. This was necessary because of the package incompatibility.

Data preprocessing

Because of the size constraints and licensing we cannot provide the datasets we used for training. However, we provide the scripts used to obtain filter and preprocess the ZINC smiles dataset and all the preprocessing scripts for the NIST GC-MS dataset.

For every dataset in the data/datasets folder, there is a README file that provides closer information about the particular dataset and explains how it was obtained.

Pretraining & Finetuning

Pretraining and finetuning can be conducted using the train_bart.py script. The script needs a couple of arguments to run, most importantly the config_file, which is a YAML file that contains all the necessary hyperparameters for the training.

All the run scripts we used for our experiments are in the run_scripts folder and don't need any additional parameters. The scripts are named run_pretrain* and run_finetune*. Their corresponding config files are in the configs folder, again named train_config_pretrain* and train_config_finetune*.

Prediction & Evaluation

Prediciton and evaluation are two separate steps. The prediction process on NIST valid/test splits takes depending on the used hardware from 4 hours to infinity. Once you have the predictions, you can run multiple evaluation runs each taking around a minute.

The prediction script, predict.py has its runner in the run_scripts folder (run_predict.sh) and corresponding config files in the configs folder (predict_config*). The evaluation script, evaluate_predicitons.py has also its runner in the run_scripts folder (run_eval.sh) and corresponding config files in the configs folder (eval_config*).

------------------------- Other folders ------------------------

predicitons

The predictions computed by our models are in the predictions folder. Along with the predictions each folder contains a log_file.yaml with all the evaluation results (sometimes from multiple evaluation runs with different setting) and figures generated by the latest evaluaiton.

tokenizer

The tokenizer folder contains all the different tokenizers used during the experiments and the final training. It also contains the traininig data for the BBPE tokenizers.

bart_spektro

This folder contains the custom implementation of the BART model used for the experiments. The implementation is based on the transformers library and is a modification of the BartForConditionalGeneration class.

notebooks

This folder contains a lot of things. Some of them are useful and nice, some of them you better not look at. I leave it in the repository as a memento of the hard work and the struggle we went through.

That's it.:)