alt text

Arima SV Pipeline for Mapping, SV Detection and QC

To order Arima kits, please visit our website: https://arimagenomics.com/

Getting Started

We provide both Docker and Singularity container implementations of the pipeline for users to run Arima SV pipeline out of the box. You can mount the necessary input and output locations and run Arima SV pipeline in two ways.

For advanced users, we also provide a non-containerized version of the pipeline in this repository. You will need to install all the tools and dependencies yourself.

Using Docker image

Please refer to the manual on Arima's DockerHub: https://hub.docker.com/repository/docker/arimaxiang/arima_sv

Using Singularity image

  • How to run Singularity on HPC when it is pre-installed
module load singularity
sudo apt-get update && sudo apt-get install -y build-essential libssl-dev uuid-dev libgpgme11-dev squashfs-tools libseccomp-dev wget pkg-config git cryptsetup
  • Install Go Download Go installer for Linux
wget https://go.dev/dl/go1.17.6.linux-amd64.tar.gz
  • Remove any previous installation of Go
sudo rm -rf /usr/local/go
  • Extract Go and put it in the directory '/usr/local'
sudo tar -C /usr/local -xzf go1.17.6.linux-amd64.tar.gz
  • Add Go to the $PATH variable
export PATH=$PATH:/usr/local/go/bin
echo 'export PATH=/usr/local/go/bin:$PATH' >> ~/.bashrc && source ~/.bashrc
  • Verify Go installation
go version  # Expected: 'go version go1.17.6 linux/amd64'
  • Create a variable with the version of Singularity to be downloaded
export VERSION=3.8.5  # Adjust this as necessary
  • Download the Singularity source code
wget https://github.com/hpcng/singularity/releases/download/v${VERSION}/singularity-${VERSION}.tar.gz
  • Extract the Singularity code
tar -xzf singularity-${VERSION}.tar.gz
  • Change directory into the Singularity directory
cd singularity-${VERSION}
  • Compile the Singularity source code
./mconfig
make -C builddir
sudo make -C builddir install
  • Verify Singularity Installation
singularity version  # '3.8.5'
  • Download Arima's Singularity container for SV pipeline
wget ftp://ftp-arimagenomics.sdsc.edu/pub/singularity/Arima-SV-Pipeline-singularity-v1.3.sif
  • Run the pipeline with Singularity (Adjust this as necessary)
singularity exec -B YOUR_HOST_OUTPUT_DIR:/FFPE/mydata/ Arima-SV-Pipeline-singularity-v1.3.sif bash /FFPE/Arima-SV-Pipeline-v1.3.sh -W 1 -B 1 -J 1 -H 0 -a /root/anaconda3/bin/bowtie2 -b /usr/local/bin/ -w /FFPE/HiCUP-0.8.0/ -j /FFPE/juicer-1.6/ -r /FFPE/Arima_files/reference/hg38/hg38.fa -s /FFPE/Arima_files/Juicer/hg38.chrom.sizes -c /FFPE/Arima_files/Juicer/hg38_GATC_GANTC.txt -x /FFPE/Arima_files/reference/hg38/hg38 -d /FFPE/Arima_files/HiCUP/Digest_hg38_Arima.txt -I /FFPE/test_data/fastq/JB_3_5M_R1.fastq.gz,/FFPE/test_data/fastq/JB_3_5M_R2.fastq.gz -o /FFPE/mydata/ -p JB_3_5M -e /FFPE/Arima_files/hic_breakfinder/intra_expect_100kb.hg38.txt -E /FFPE/Arima_files/hic_breakfinder/inter_expect_1Mb.hg38.txt -t 12

Usage (Command line options)

Arima-SV-Pipeline-v1.3.sh [-W run_hicup] [-B run_hic_breakfinder] [-J run_juicer] [-H run_hiccups] [-r reference_file] [-s chrom_sizes_file] [-c cut_site_file] [-a bowtie2] [-x bowtie2_index_basename] [-d digest] [-w hicup_dir] [-b hic_breakfinder_dir] [-j juicer_dir] [-I FASTQ_string] [-S sample_size] [-o out_dir] [-p output_prefix] [-e exp_file_intra] [-E exp_file_inter] [-t threads] [-C] [-v] [-h]

  • [-W run_hicup]: "1" (default) to run HiCUP pipeline, "0" to skip. If skipping, HiCUP_summary_report_*.txt and *R1_2*.hicup.bam need to be in the HiCUP output folder.
  • [-B run_hic_breakfinder]: "1" (default) to run hic_breakfinder, "0" to skip
  • [-J run_juicer]: "1" (default) to run Juicer, "0" to skip
  • [-H run_hiccups]: "1" to run HiCCUPS, "0" (default) to skip
  • [-r reference_file]: reference FASTA file
  • [-s chrom_sizes_file]: chrom.sizes file generated from the reference file
  • [-c cut_site_file]: cut site file used by Juicer
  • [-a bowtie2]: bowtie2 tool location
  • [-x bowtie2_index_basename]: bowtie2 index file prefix
  • [-d digest]: genome digest file produced by hicup_digester
  • [-w hicup_dir]: directory of the HiCUP tool
  • [-b hic_breakfinder_dir]: directory of the hic_breakfinder tool
  • [-j juicer_dir]: directory of the Juicer tool
  • [-I FASTQ_string]: a pair of FASTQ files separated by "," (no space is allowed)
  • [-S sample_size]: (Optional) subsample the input FASTQ files (default: no subsample)
  • [-o out_dir]: output directory
  • [-p output_prefix]: output file prefix (filename only, not including the path)
  • [-e exp_file_intra]: intra-chromosomal background model file for normalization
  • [-E exp_file_inter]: inter-chromosomal background model file for normalization
  • [-t threads]: number of threads to run HiCUP and Juicer
  • [-C]: clean up intermediate files to reduce space
  • [-v]: print version number and exit
  • [-h]: print this help and exit

Example

bash /FFPE/Arima-SV-Pipeline-v1.3.sh -W 1 -B 1 -J 1 -H 0 -a /root/anaconda3/bin/bowtie2 -b /usr/local/bin/ -w /FFPE/HiCUP-0.8.0/ -j /FFPE/juicer-1.6/ -r /FFPE/Arima_files/reference/hg38/hg38.fa -s /FFPE/Arima_files/Juicer/hg38.chrom.sizes -c /FFPE/Arima_files/Juicer/hg38_GATC_GANTC.txt -x /FFPE/Arima_files/reference/hg38/hg38 -d /FFPE/Arima_files/HiCUP/Digest_hg38_Arima.txt -I /FFPE/test_data/fastq/JB_3_5M_R1.fastq.gz,/FFPE/test_data/fastq/JB_3_5M_R2.fastq.gz -o /FFPE/test_data/test_output_docker/ -p JB_3_5M -e /FFPE/Arima_files/hic_breakfinder/intra_expect_100kb.hg38.txt -E /FFPE/Arima_files/hic_breakfinder/inter_expect_1Mb.hg38.txt -t 12

Test datasets

ftp://ftp-arimagenomics.sdsc.edu/pub/Arima_FFPE/fastq/JB_3_5M_R1.fastq.gz
ftp://ftp-arimagenomics.sdsc.edu/pub/Arima_FFPE/fastq/JB_3_5M_R2.fastq.gz
ftp://ftp-arimagenomics.sdsc.edu/pub/ARIMA_TEST_DATASET/K562_HiC/K562_100M_R1.fastq.gz
ftp://ftp-arimagenomics.sdsc.edu/pub/ARIMA_TEST_DATASET/K562_HiC/K562_100M_R2.fastq.gz

Pipeline Outputs

The Arima SV pipeline Bioinformatics User Guide walks through an example of how to run the Arima SV pipeline using the provided test data and provides additional information on the output files. The Arima SV pipeline generates multiple files. Main output files are:

Arima Shallow Sequencing QC

[output_directory]/[output_prefix][version#]_Arima_QC_shallow.txt

Contents: This file includes QC metrics for assessing the shallow sequencing data for each CHiC library.

  • Break down of the number of read pairs
  • The target sequencing depth for deep sequencing
  • The percentage of long-range cis interactions that overlap the probe regions

Arima Deep Sequencing QC

[output_directory]/[output_prefix][version#]_Arima_QC_deep.txt

Contents: This file includes QC metrics for assessing the deep sequencing data for each CHiC library.

  • Break down of the number of read pairs
  • The number of loops called
  • The percentage of long-range cis interactions that overlap the probe regions

SV file in .bedpe format from hic_breakfinder

[output_directory]/hic_breakfinder/[output_prefix].breaks.bedpe

HiC heatmap for visualization using Juicebox

[output_directory]/juicer/aligned/[output_prefix]_inter_30.hic

Definition of QC Metrics

  • Raw_pairs: Raw # of read pairs
  • Mapped_SE_reads: # of single-end reads that can be mapped to the reference
  • %_Mapped_SE_reads: % of single-end reads that can be mapped to the reference, out of total # of single-end reads
  • %_Truncated: % of truncated single-end reads, out of all single-end reads
  • Duplicated_pairs: # of duplicated read pairs
  • %_Duplicated_pairs: % of duplicated read pairs, out of all read pairs
  • Unique_valid_pairs: HiC read pairs which are not derived from artifacts such as self-circles and dangling-ends, and which contain spatial proximity information
  • %_Unique_valid_pairs: % of unique valid read pairs, out of all read pairs
  • [Obsolete] Library Complexity: Theoretical # of unique molecules in a Hi-C library.
  • Intra_pairs: All unique pairs where both read-ends align to the same chromosome
  • %_Intra_pairs: % of all unique total pairs that have both read-ends aligning to the same chromosome
  • Intra_ge_15kb_pairs: All unique pairs where both read-ends align to the same chromosome and have an insert size >=15kb
  • %_Intra_ge_15kb_pairs: % of all unique total pairs that have both read-ends aligning to the same chromosome and have an insert size >=15kb
  • Inter_pairs: All unique pairs where each read-end aligns to a different chromosome
  • %_Inter_pairs: % of all unique total pairs where each read-end aligns to a different chromosome
  • Invalid_pairs: # of invalid read pairs
  • %_Invalid_pairs: % of invalid read pairs, out of all read pairs
  • Same_circularised_pairs: # of same-circularised pairs
  • %_Same_circularised_pairs: % of same-circularised pairs, out of all read pairs
  • Same_dangling_ends_pairs: # of same dangling-ends pairs
  • %_Same_dangling_ends_pairs: % of same dangling-ends pairs, out of all read pairs
  • Same_fragment_internal_pairs: # of same-fragment-internal pairs
  • %_Same_fragment_internal_pairs: % of same-fragment-internal pairs, out of all read pairs
  • Re_ligation_pairs: # of re-ligation pairs
  • %_Re_ligation_pairs: % of re-ligation pairs, out of all read pairs
  • Contiguous_sequence_pairs: # of contiguous-sequence pairs
  • %_Contiguous_sequence_pairs: % of contiguous-sequence pairs, out of all read pairs
  • Wrong_size_pairs: # of wrong-size pairs
  • %_Wrong_size_pairs: % of wrong-size pairs, out of all read pairs
  • Mean_lib_length: Mean library length
  • Lcis_trans_ratio: The ratio of Lcis to Trans data. This is the signal to noise ratio for translocation calling.
  • Target Raw Reads for Deep Seq: # of Raw PE Reads to obtain high sensitivity SV calls Based on bench marking against ground truth datasets. This value is calibrated to balance Sensitivity and Specificity of the SV calls from hic_breakfinder. This value uses conservative values for the mapping rate at 80% and the duplicate rate at 30%.
  • SVs: # of SV calls made by the Arima SV Pipeline.

Current Pipeline Version

1.3

Release Note

v1.1

  • Previously, the %Lcis, %Scis and %Trans (%inter) were calculated based on the total number of unique valid pairs. We changed the calculation in this version in order to make it consistent with Juicer pipeline's calculation (%Lcis = Lcis / uniquely mapped reads). We now use all unique pairs (no matter valid or invalid) as the denominator instead of the unique valid pairs. Please do not directly compare those percentage values generated by this version and the previous version. Note that users should not rely solely on %Lcis value to evaluate sample quality. This value should not be treated as a hard cut off. Instead, please use Lcis/Trans ratio to evaluate sample quality. Even with %Lcis less than 25% or Lcis/Trans ratio close to 1, we can still identify meaningful and real SVs out of the sample.
  • Changed the calculation of the "intra pairs" to make it closer to Juicer pipeline's calculation. The new estimated value is coming from all unique pairs, instead of unique valid pairs.
  • Other minor fixes

v1.2

  • The name of the output QC metrics files now contains the pipeline version #
  • You can now skip any modules in the pipeline and still get the final QC table
  • Other minor fixes

v1.3

  • HiC heatmap is generated from BAM file directly, instead of running the entire Juicer pipeline
  • Added detailed breakdown list of invalid read pairs categories to the final QC table
  • Added insert size calculation to the final QC table
  • Added an option to remove intermediate files to reduce space
  • Added an option for sub-sampling input FASTQ files (default: no subsampling)
  • Renamed inter_30.hic and merged_loops.bedpe files to include the sample name
  • The chromosomes in the HiC heatmap are now sorted by name instead of sorted by length
  • Fixed HiC heatmap with desktop version of Juicebox compatibility issue
  • Removed library complexity calculation from the pipeline
  • A log file (Arima_SV_v1.3_*.log) will be automatically generated in the output folder

Support

For Arima customer support, please contact techsupport@arimagenomics.com

Acknowledgments

  • Wingett S, et al. (2015) HiCUP: pipeline for mapping and processing Hi-C data F1000Research, 4:1310 (doi: 10.12688/f1000research.7334.1)
  • Dixon, J. R., Xu, J., Dileep, V., Zhan, Y., Song, F., Le, V. T., ... & Yue, F. (2018). Integrative detection and analysis of structural variation in cancer genomes. Nature genetics, 50(10), 1388-1398.
  • Neva C. Durand, Muhammad S. Shamim, Ido Machol, Suhas S. P. Rao, Miriam H. Huntley, Eric S. Lander, and Erez Lieberman Aiden. "Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments." Cell Systems 3(1), 2016.