viroCapt is a bioinformatics pipeline and analysis platform for detecting viral insertions in the human genome from viral capture data.
viroCapt is implemented as an R package, which contains the pipeline implemented as a Makefile, dedicated R functions from the package to infer, score and rank insertions candidates, and a Shiny webapp for interactive visualization.
viroCapt depends on standard bioinformatics tools that need to be installed on the system:
- bowtie2
- samtools
- blat
You will also need GNU make
and R
to run the pipeline itself.
All of the R package dependencies will be installed alongside the package using devtools
.
All the following commands are to be executed in an R session.
Install devtools
install.packages("devtools")
If you have a working git installation and R is configured to access internet
devtools::install_github("maximewack/viroCapt")
If not, a local installation is also possible.
Download the package archive and extract it, then run
devtools::install("/path/to/viroCapt")
This will install the R package that will be used by the pipeline, provide the pipeline and reference files, as well as the Shiny webapp.
viroCapt contains two parts: a bioinformatics pipeline that runs in batch and produces static results, and an analysis tool that uses some of those results to produce visualizations.
The pipeline files are copied to the package installation directory.
You can find the location of that directory by executing the following command in R after the package is installed
system.file(package = "viroCapt")
To run the pipeline, you need the makefile
file, and the refs
folder, containing the reference genomes for all the HPV variants that were used in the design of the probes for HPV capture.
It is recommended to copy those files from the package to where you want to use them.
To run the pipeline, the makefile
and refs
folder should be in the same folder, with the fastq
files to analyze in a subdirectory.
All fastq
files should match one of these patterns:
xxx.R1.fastq
andxxx.R2.fastq
for pair-end read sequencingxxx.U.fastq
for single read sequencing
The fastq
files can also be gzipped.
You can then run the pipeline with make
, with the following options available (and their default value):
- DATADIR (fastqs): folder with the
fastq
files to analyse - THREADS (2): number of threads to use for
bowtie2
andblat
- QCDIR (01_QC): destination of the QC plots
- GLOBALDIR (02_Global): destination of the global alignement and crude genotyping results
- LOCALDIR (03_Local): destination of the local alignement files and genotyping plots
- FASTADIR (04_Fasta): destination of the
fasta
files of the candidate human sequences - BLATDIR (05_Blat): destination of the raw
blat
results, cleaned results and ranked results - VISUDIR (06_Visu): destination of the files used by the analysis tool
- BT_OPTIONS_GLOBAL (–no-unal –very-sensitive):
bowtie2
options for global alignment - BT_OPTIONS_LOCAL (–local –no-unal –very-sensitive-local):
bowtie2
options for local alignement - BLAT_OPTIONS (-minScore=25 -minIdentitiy=90 -noHead):
blat
options for the human alignment - LOCAL_PLOT_LIMIT (5): top n genotypes to show in the local plots
- FINAL_PLOT_LIMIT (1): top n genotypes to show in the final plots
For example, if your fastq
files reside in the fastq_samples
folder, and you want to run the pipeline on 8 CPU threads:
make DATADIR=fastq_samples THREADS=8
The default make
target is All, which executes the QC, Local_plot, Final_plot, and Visu targets.
You specify any of those targets to re-run the pipeline up to that target.\
For example, if you want to change the number of genotypes presented in some local plots, you can delete the plots that you want re-generated and run
make DATADIR=fastq_samples LOCAL_PLOT_LIMIT=8 Local_plot
The pipeline will print the output of every command as it is used, and will print execution milestones in red.
All the analysis results will appear in the 01_*
to 06_*
folders within the folder containing the source fastq
files.
The bowtie index files for all references will be built as needed, as well as the download of the reference human genome from the UCSC, usually with the first run.
Once the pipeline has successfully run, you can use the rds
files (in the 06_Visu
folder) in the visualization tool.
In R, run
library(viroCapt)
visu()
or simply
viroCapt::visu()
You can use any option of shiny::runApp()
in viroCapt::visu()
to customize how the Shiny app runs.
For example, if you want to make the tool available to other users in the same network on port 1234
viroCapt::visu(host = "0.0.0.0",
port = 1234)
The fastq
files produced by running HPV capture on the HeLa cell line are distributed with the package, in the hela/
folder in the files installed by the package.
They are accompanied by the expected results files and can be used to check the package is correctly installed and running.