/fonda

Fonda is a framework which offers scalable and automatic analysis of multiple NGS sequencing data types

Primary LanguageJavaApache License 2.0Apache-2.0

Build status codecov Codacy Badge

Fonda

Fonda is a framework that offers a scalable and automatic analysis of multiple NGS sequencing data types.

Fonda Prebuilt binaries

All the binaries, built by the CI process (described in CONTRIBUTING.md) are available via the Download page and the GitHub Release page

Required environment setup

  • Unix
  • Java 8

Build Fonda

To launch all unit and integration tests run the command:

./gradlew test

To launch all unit and integration tests, to perform the source code analysis (via PMD), to check the code adherement to a coding standard (via checkstyle) and to count the code coverage (via JaCoCo) run the command:

./gradlew check

To build Fonda run the command:

./gradlew clean build zip
  • clean - deletes the Fonda build directory for a fresh compile
  • build - creates Fonda .jar file and src folder in build/libs
  • zip - packs Fonda .jar and src folder into a zip file located in build/distributions

Note: before building a specific Fonda version, please check the Fonda version in the build.gradle file is the correct one.

Fonda installation

Fonda package contains two components:

  1. Fonda .jar file
  2. src folder

If the src_scripts option in global config is not set, please make sure src folder and .jar file are put in the same parental directory for proper usages. This is necessary because Fonda needs to call some external scripts from src folder (python and R subfolders) in some pipeline usages.
For different pipeline utilities, the user needs to make sure the corresponding software prerequisites are properly installed before executing a specific Fonda pipeline. The user can check the required software and databases in the global_config files.

Available workflows in Fonda

Workflow Description
DnaCaptureVar_Fastq DNA Captured sequencing data for genomic variant detection using fastq data
DnaCaptureVar_Bam DNA Captured sequencing data for genomic variant detection using bam data
DnaAmpliconVar_Fastq DNA Amplicon sequencing data for genomic variant detection using fastq data
DnaAmpliconVar_Bam DNA Amplicon sequencing data for genomic variant detection using bam data
DnaWgsVar_Fastq DNA whole genome sequencing data for genomic variant detection using fastq data
DnaWgsVar_Bam DNA whole genome sequencing data for genomic variant detection using bam data
RnaCaptureVar_Fastq RNA Captured sequencing data for genomic variant detection using fastq data
HlaTyping_Fastq DNA sequencing data for genomic HLA type prediction using fastq data
Bam2Fastq Convert bam file to fastq files
RnaExpression_Fastq RNA sequencing data for gene expression analysis using fastq data
RnaExpression_Bam RNA sequencing data for gene expression analysis using bam data
scRnaExpression_Fastq single cell RNA sequencing data for gene expression analysis using fastq data
scRnaExpression_CellRanger_Fastq 10X single cell RNA/TCR/BCR sequencing data for gene expression and immune profiling analysis using fastq data
scRnaExpression_Bam single cell RNA sequencing data for gene expression analysis using bam data
RnaFusion_Fastq RNA sequencing data for gene fusion detection using fastq data
TcrRepertoire_Fastq DNA or RNA sequencing data for TCR or BCR repertoire detection using fastq data

Before running Fonda…

Show help message

java -jar fonda-<VERSION>.jar -help

Possible options:

Option Description
Required
-global_config <arg> Configuration file for the particular workflow
-study_config <arg> Configuration file for the specific study
Non-required
-detail Show the details of the Fonda framework
-local Default: no. Running the job on local machine
-test Default: no. Test the commands without actually running the job
-sync Default: no. Running Fonda in asynchronous mode, waiting for all tasks to complete
-master Default: no. Running the main master script to manage all Fonda created scripts
-help Show help utility message

Elaboration of required config arguments

-global_config file - sets a configuration file for a particular pipeline version (such as RnaExpression_Fastq 1095.1). In the config file, there are 4 sections:

  • [all_tools] - contains paths to used tools
  • [Databases] - contains input data/paths to input datasets
  • [Pipeline_Info] - contains workflow and toolset settings
  • [Queue_Parameters] - contains sge settings

If the user likes to change a parameter, a new version should be generated and recorded. However, different studies can share an identical pipeline.

Available parameter options for the global_config files you can see here.
Examples of the global_config files you can see here.

Please keep in mind that in each global_config file the only tools and databases are included that are required for executing this specific pipeline version.
For example, global_config_RnaExpression_Fastq_v1.1.txt may list out the databases, tools and parameters for a particular RnaExpression_Fastq pipeline version 1. Later on, global_config_RnaExpression_Fastq_v1.2.txt may be prepared for another RnaExpression_Fastq pipeline version 2. In the second config the required databases, tools and parameters might be quite different from the first one.
Therefore, all potential databases, tools and parameter options for each available workflow shall be listed out to make sure users can take the full advantage of using Fonda in different projects.

To control the line-endings behavior the line_ending option was introduced in the [Pipeline_Info] section. The option can be specified as LF (Unix-style end-of-line marker) or CRLF (Windows-style end-of-line marker) value. If the option is not specified, the LF line separator was set as the default one.

-study_config file - sets a configuration file for a particular study - for cases when a specific study is selected to perform the NGS data analysis. In this config file, there is 1 section - [Series_Info].
Required parameters for each workflow:

Parameter Description
job_name Sets the job ID
dir_out Sets the output directory for the analysis
fastq_list / bam_list Sets the path to the input manifest file
LibraryType Sets the sequencing library type - DNAWholeExomeSeq_Paired, DNAWholeExomeSeq_Single, DNATargetSeq_Paired, DNATargetSeq_Single, DNAAmpliconSeq_Paired, RNASeq_Paired, RNASeq_Single, etc.
DataGenerationSource Sets the data generation source - Internal, IGR, Broad, etc.
Date Sets the sequencing run date
Project Sets the project ID
Run Sets the run ID

The format of input manifest files see here.
Examples of the study_config files you can see here.

Elaboration of additional arguments

-help - to show the help message
-detail - to show the workflow details available in the current Fonda framework
-local - to run the job on the local machine without being submitted to the cluster
-test - to have a pilot run in the command line interface without actually submitting jobs to the cluster

Run Fonda: actual example for RnaExpression_Fastq workflow

Test mode

java -jar /path_to_data/fonda/<VERSION>/fonda-<VERSION>.jar -global_config /path_to_data/fonda/global_config/global_config_RnaExpression_Fastq_v1.1.txt -study_config /path_to_data/config_RnaExpression_Fastq_test.txt -test

For the test mode, no job will be submitted to the cluster for actual run. In this case, you will be able to check whether the contents in each shell scripts are properly organized. This is important for debugging purposes.

Submit jobs to cluster

java -jar /path_to_data/fonda/<VERSION>/fonda-<VERSION>.jar -global_config /path_to_data/fonda/global_config/global_config_RnaExpression_Fastq_v1.1.txt -study_config /path_to_data/config_RnaExpression_Fastq_test.txt

Local machine mode

java -jar /path_to_data/fonda/<VERSION>/fonda-<VERSION>.jar -global_config /path_to_data/fonda/global_config/global_config_RnaExpression_Fastq_v1.1.txt -study_config /path_to_data/config_RnaExpression_Fastq_test.txt -local

For the local machine mode, the individual jobs will be run on the local machine, without being submitted to the cluster.
In this case, scripts will be the same as in the cluster mode. The only difference is the jobs are not submitted to the cluster. This is important for debugging purpose.

Contributors

  • Shu Yan 1
  • Tenghui Chen 1
  • Joon Sang Lee 1
  • Chandra Sekhar Pedamallu 1
  • Mark Magid 1
  • Quan Wan 1
  • Ei-Wen Yang 1
  • Donald Jackson 1
  • Jack Pollard 1
  • Aleksandr Sidoruk 2
  • Mariia Zueva 2
  • Mikhail Alperovich 2
  • Yulia Kamyshova 2

1 Sanofi, 270 Albany Street, Cambridge, MA, USA

2 EPAM Systems, Inc.

Publications

Links to publications that contain Fonda references