MS2-Autoencoder

MS2 Autoencoder is built on Keras for Python. The purpose of MS2 Autoencoder is to create a generalized model of MS2 spectra so that any low quality spectra can be upscaled to a high quality spectra (with quality being baed on precursor intensity). The direct general application of this tool is denoising spectra.

Tools

Miniconda or Anaconda
NextFlow

Imports

pyteomics
h5py
keras autoencoder tutorial
tensorflow (tensorflow-gpu or tensorflow*)
- *tensorflow-gpu worked on version 1.14 with cudnn version 10.0

Structure

Extract mzxml/mzml files for MS2 data
Stitch all extracted data files (.npz) into HDF5 file (.hdf5)
Train autoencoder, deep autoencoder, convolutional neural network,... variational autoencoder, LSTM
Evaluate and predict test data on models
Achieve spectra upscaling/denoising

1. Extract mzxml

In MS2-Autoencoder/bin/main.py import extract_mzxml as em
The else statement in main.py is the entire top to bottom flow of mzxml data extraction
This step should be run on the cluster with nohup and NextFlow to gather all of the data
The Makefile includes functions (instructions) for NextFlow to run main.py on all QExactive data on GNPS(Nov/2019)

2. Stitch .npz into .hdf5

Use SCP to transfer extracted outdirs from cluster to local (advised that .json files are rm -r from outdir)
- only ready_array2.npz or a .npz file is needed for stitching
In MS2-Autoencoder/bin/processing.py import concat_hdf5.py as ch5
Specify path to the parent directory of all outdirs, specify name of the data file ('ready_array2.npz')
processing.py will concatenate all .npz; it will output two .hdf5 files
1. Autoencoder structured dataset
2. Convolution neural network 1D structured dataset

3. Train models

Model architecture is outlined in ms2-autoencoder.py, ms2-conv1d.py, ms2-deepautoencoder.py
Generators, training, evaluating, predicting, and all model architectures are in ms2_model.py
In train_models.py import ms2_model.py
Trained models are saved as .h5 with architeture and weights
Models training function is built on tensorflow-gpu with gpu memory allocation and session declaration
Model training can be done on local or cluster machine

4. Evaluate and Predict models

Jupyter/keras load validate.ipynb is the Jupyter Notebook for loading models and visualizating predictions
Models prediction function is built on tensorflow-gpu with gpu memory allocation and session declaration
Predictions are written into an mgf to undergo library search at GNPS

5. Spectra denoising

Hopefully cosine proximity is closer to 1.0 than 0.0

cmaceves/MS2-Autoencoder