/viroCapt

Primary LanguageRGNU General Public License v3.0GPL-3.0

viroCapt

viroCapt is a bioinformatics pipeline and analysis platform for detecting viral insertions in the human genome from viral capture data.

Installation

viroCapt is implemented as an R package, which contains the pipeline implemented as a Makefile, dedicated R functions from the package to infer, score and rank insertions candidates, and a Shiny webapp for interactive visualization.

Dependencies

viroCapt depends on standard bioinformatics tools that need to be installed on the system:

  • bowtie2
  • samtools
  • blat

You will also need GNU make and R to run the pipeline itself.
All of the R package dependencies will be installed alongside the package using devtools.

Installation

All the following commands are to be executed in an R session.

Install devtools

install.packages("devtools")

If you have a working git installation and R is configured to access internet

devtools::install_github("maximewack/viroCapt")

If not, a local installation is also possible.
Download the package archive and extract it, then run

devtools::install("/path/to/viroCapt")

This will install the R package that will be used by the pipeline, provide the pipeline and reference files, as well as the Shiny webapp.

Usage

viroCapt contains two parts: a bioinformatics pipeline that runs in batch and produces static results, and an analysis tool that uses some of those results to produce visualizations.

Pipeline

The pipeline files are copied to the package installation directory.
You can find the location of that directory by executing the following command in R after the package is installed

system.file(package = "viroCapt")

To run the pipeline, you need the makefile file, and the refs folder, containing the reference genomes for all the HPV variants that were used in the design of the probes for HPV capture.
It is recommended to copy those files from the package to where you want to use them.

To run the pipeline, the makefile and refs folder should be in the same folder, with the fastq files to analyze in a subdirectory.
All fastq files should match one of these patterns:

  • xxx.R1.fastq and xxx.R2.fastq for pair-end read sequencing
  • xxx.U.fastq for single read sequencing

The fastq files can also be gzipped.

You can then run the pipeline with make, with the following options available (and their default value):

  • DATADIR (fastqs): folder with the fastq files to analyse
  • THREADS (2): number of threads to use for bowtie2 and blat
  • QCDIR (01_QC): destination of the QC plots
  • GLOBALDIR (02_Global): destination of the global alignement and crude genotyping results
  • LOCALDIR (03_Local): destination of the local alignement files and genotyping plots
  • FASTADIR (04_Fasta): destination of the fasta files of the candidate human sequences
  • BLATDIR (05_Blat): destination of the raw blat results, cleaned results and ranked results
  • VISUDIR (06_Visu): destination of the files used by the analysis tool
  • BT_OPTIONS_GLOBAL (–no-unal –very-sensitive): bowtie2 options for global alignment
  • BT_OPTIONS_LOCAL (–local –no-unal –very-sensitive-local): bowtie2 options for local alignement
  • BLAT_OPTIONS (-minScore=25 -minIdentitiy=90 -noHead): blat options for the human alignment
  • LOCAL_PLOT_LIMIT (5): top n genotypes to show in the local plots
  • FINAL_PLOT_LIMIT (1): top n genotypes to show in the final plots

For example, if your fastq files reside in the fastq_samples folder, and you want to run the pipeline on 8 CPU threads:

make DATADIR=fastq_samples THREADS=8

The default make target is All, which executes the QC, Local_plot, Final_plot, and Visu targets.
You specify any of those targets to re-run the pipeline up to that target.\ For example, if you want to change the number of genotypes presented in some local plots, you can delete the plots that you want re-generated and run

make DATADIR=fastq_samples LOCAL_PLOT_LIMIT=8 Local_plot

The pipeline will print the output of every command as it is used, and will print execution milestones in red.
All the analysis results will appear in the 01_* to 06_* folders within the folder containing the source fastq files.

The bowtie index files for all references will be built as needed, as well as the download of the reference human genome from the UCSC, usually with the first run.

Visualization

Once the pipeline has successfully run, you can use the rds files (in the 06_Visu folder) in the visualization tool.

In R, run

library(viroCapt)
visu()

or simply

viroCapt::visu()

You can use any option of shiny::runApp() in viroCapt::visu() to customize how the Shiny app runs.
For example, if you want to make the tool available to other users in the same network on port 1234

viroCapt::visu(host = "0.0.0.0",
               port = 1234)

Test data

The fastq files produced by running HPV capture on the HeLa cell line are distributed with the package, in the hela/ folder in the files installed by the package.
They are accompanied by the expected results files and can be used to check the package is correctly installed and running.