NASQQ: Nextflow Automatization and Standardization for Qualitative and Quantitative 1H NMR Metabolomics

Table of Contents

About The Project
Workflow Overview
Getting Started
- Prerequisites
- Project Setup
Final Remarks

About The Project

NASQQ is a comprehensive pipeline designed to automate the preparation and analysis of 1H NMR metabolomics data. It streamlines the process from raw Bruker FIDs through spectral preprocessing and metabolite identification to data analysis and pathway enrichment. This approach accelerates the comprehension of metabolomics in analyzed subjects, eliminating the need for specialized domain knowledge.

Features

Automated Workflow: NASQQ automates the entire metabolomic analysis process, reducing manual intervention and ensuring reproducibility.
Comprehensive Analysis: The pipeline covers spectral preprocessing, metabolite identification, data analysis, and pathway enrichment, providing a holistic view of the metabolomic data.
Machine Learning Integration: NASQQ incorporates machine learning methods to bridge the gap between raw spectral information and biological insights.

Workflow overview

Load FIDs: Retrieve raw FIDs from a specified location, extract sample names, and filter pulse program.
Group Delay Correction: Eliminate Bruker Group Delay from the FIDs.
Solvent Suppression: Estimate and eliminate residual solvent signals from the FIDs.
Apodization: Enhance the Signal-to-Noise ratio in the spectra.
Zero Filling: Enhance the visual clarity of spectra by inserting zeros.
Fourier Transformation: Convert FIDs from the time domain to frequency domain spectra using Fourier Transformation.
Zero Order Phase Correction: Adjust spectra phase to ensure pure absorptive mode in the real part.
Internal Referencing: Align spectra with an internal reference compound.
Baseline Correction: Estimate and remove spectral baseline from the spectral profiles.
Negative Values Zeroing: Set all negative values in spectra to zero.
(Optional) Warping: Apply Semi-Parametric Time Warping technique to warp and realign spectra.
Window Selection: Choose the informative segment of spectra.
(Optional) Bucketing: Simplify density of spectra peaks.
Normalization: Normalize the spectra.
Metabolites Quantification: Identify and quantify metabolites based on normalized spectra.
Add Metadata: Merge metadata with quantified metabolites' relative abundances.
(Optional) Combine Dataset Batches: Merge batches from the dataset for streamlined analysis.
Features Processing: Load data and perform sanity checks.
Exploratory Data Analysis: Conduct Principal Component Analysis and generate exploratory analysis visualizations.
Univariate Analysis: Identify outliers, assess data normality, and conduct univariate statistical tests.
Multivariate Analysis: Utilize machine learning models to analyze metabolite data.
Pathway Analysis: Perform pathway enrichment analysis using KEGG database entries.

For detailed information on each stage of the analysis and scripts, refer to docs folder, where separate README.md files are provided.

Note: NASQQ is an extension of existing solutions, aimed at enhancing the accessibility and efficiency of metabolomic data analysis. The Workflow is designed to be system agnostic, however it was tested only on MacOS (M1 chip) and Linux (Ubuntu 22.04). In order to use pipeline on Windows system please refer to WSL

(Back to Top)

Getting Started

To begin using the pipeline, it's essential to ensure that certain prerequisites are met and project is properly set up. Please review the following sections:

Prerequisites

Install Docker
Install NextFlow
(Optional) Install precommit

Project setup

Clone the project's Github repository to your local machine:

git clone https://github.com/ardigen/nasqq

Note: Grant appropriate permissions to the workflow directory: chmod 777 -R <location>/nasqq

Next, build Docker images as the workflow requires Docker containers for both R and Python environments. There are R and Python Dockerfiles needed to execute the workflow, which are compatible with Linux and MacOS (M1 chips) systems.

For Linux user execute:

cd nasqq/docker/Python
./build_docker_linux.sh
cd nasqq/docker/R
./build_docker_linux.sh

For MacOS (M1) user execute:

cd nasqq/docker/Python
./build_docker_macos.sh
cd nasqq/docker/R
./build_docker_macos.sh

After setting up project create coma separated manifest.csv file, with following structure and headers:

dataset,batch,input_path,metadata_file,selected_sample_names,target_value,referencing_range,window_selection_range
test1,test1,./testthat/data/dataset/dataset1,./testthat/data/metadata/metadata1.csv,500;501;503;504,0,None,0;10
test2,test2,./testthat/data/dataset/dataset2,./testthat/data/metadata/metadata2.csv,all,0,None,0;5
test3,None,./testthat/data/dataset/dataset3,./testthat/data/metadata/metadata3.csv,502;505;507;508;509;510,2,2.5;4.55,0;10

dataset - name of dataset.
batch - batch name (Default: None).
input_path - absolute path to NMR dataset in Bruker format.
metadata_file - absolute path to metadata file to be merged with dataset.
selected_sample_names - selection of sample names, ";" separated (Default: all).
target_value - PPM value of the signal used as the internal reference spectra (Default: 0).
referencing_range - if target_value is different from the default, the range where the referencing signal will be searched (Defaul: None).
window_selection_range - range of the informative part of the spectra, separated by ";" (Default: 0;10).

Another file that needs to be created is params.yml. This document outlines the required inputs for configuring the data processing pipeline. Make sure to fill in the necessary values according to table below.

Input	Description	Datatype
manifest	Absolute path to the manifest.csv file containing metadata information for the analysis	string
outDir	Absolute path to the directory where the output files will be stored	string
reportsDir	Absolute path to the directory where the analysis reports will be generated	string
workDir	Absolute path to the directory where the intermediate work files will be stored	string
launchDir	Absolute path to the directory from which the pipeline is launched	string
maxRetries	Number of attempts the pipeline should make to process a task before giving up	integer
errorStrategy	The strategy to handle errors during pipeline execution (terminate/ignore/retry)	string
check_pulse_samples	The pulse program specified in the manifest file for processing	string
run_bucketing	Enable/disable bucketing for simplifying the density of peaks before metabolite quantification	boolean
run_warping	Enable/disable warping for spectra re-alignment based on a reference spectrum	boolean
run_combine_project_batches	Enable/disable merging datasets for data analysis where batch is not "None"	boolean
ncores	The number of threads allocated for the ASICS quantification task	integer
log1p	Enable/disable log1p normalization of metabolites before data analysis	boolean
metadata_column	The column containing binary state information for the data analysis module	string
reverse_axis_samples	Specifies whether to reverse the axis for all samples or selected samples based on a threshold	string

After completing every step open run.sh and adjust paths for execution of workflow or run manually using command:

nextflow run ../main.nf \
    -c ../nextflow.config \
    -profile standard \
    -params-file params.yml

(Back to Top)

Final remarks

Tests

In order to run the test data simply go the tests directory and run the test run:

./tests/run.sh

Memory allocation

Please remember that based on the number of datasets provided in the manifest your local machine has to have that many resources. [visit this thread: nextflow-io/nextflow#1787] The lack of resources can lead to incorrect memory allocation in the script. It is recommended to change max_cpus and max_memory params in nextflow.config file accordingly to resources avaibale on your local machine.

example:

 *** caught segfault ***
  address 0x7ff0000000000003, cause 'memory not mapped'

Be aware that NextFlow is not a resource orchestration system. If you need it, there is a need of creation of custom executor like aws or kubernetess.

Note: The default setting for the computation cannot be lower than:

cpus = 2

memory = 2.GB RAM

License

NASQQ is distributed under the MIT License. See LICENSE.md for more information.

Contact

For contact purposes, there is a dedicated email address: nasqq@ardigen.com

Credits and acknowledgments

The scripts and workflow was originally created as a part of Łukasz Pruss's PhD project, in collaboration between Ardigen S.A. and Wrocław University of Science and Technology (WUST). A special acknowledgment goes to Oskar Gniewek, whose expertise and critical feedback significantly contributed to the implementation of NextFlow. He also played a crucial role in managing unit and integration tests, as well as handling dependencies across various systems for pipeline execution.

Furthermore, many people were involved in the evolution of the pipeline, turning it from a concept into an end-to-end solution. These contributors include:

Special thanks for the assistance in development process, code reviews and tips are extend to:

Citations

An extensive list of references and packages used by the pipeline can be found in our publication:

NASQQ: Nextflow automatization and standarization for qualitative and quantitative 1H NMR metabolomics data preparation and analysis.

Łukasz Pruss, Oskar Gniewek, Tomasz Jetka, Wojciech Wojtowicz, Kaja Milanowska-Zabel, Piotr Młynarz.

DOI: --

If you want to utilize NASQQ for your analysis, please refer to LICENSE.md

To cite the nf-core publication use:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

(Back to Top)