VARUS was originally written by Willy Bruhn as a Bachelors' thesis supervised by Mario Stanke. This repository is a copy of https://github.com/WillyBruhn/VARUS made in November 2018 and contains many bugfixes, an incremental intron database feature and an extension for using HISAT al alternative alignment program.
VARUS automates the selection and download of a limited number of RNA-seq reads from at NCBI's Sequence Read Archive (SRA) targeting a sufficiently high coverage for many genes for the purpose of gene-finder training and genome annotation. Each iteration of the online algorithm
- selects a run to download that is expected to complement previously downloaded reads
- downloads a sample of reads ("batch") from the run with fastq-dump
- aligns the reads with STAR or HISAT
- evaluates the alignment
Invoke the following command from the command-line in order to clone the repository:
git clone https://github.com/MarioStanke/VARUS.git
VARUS depends on
- samtools,
- bamtools, install on Ubuntu with
sudo apt-get install bamtools libbamtools-dev
- fastq-dump and
- STAR or HISAT2 (tested with HISAT 2, version 2.0.0-beta)
Compile VARUS manually with
cd Implementation
make
By default the NCBI tool fastq-dump
creates temporary files under ~/ncbi
of the same size as the run file from which data is downloaded, even if only a small part thereof is downloaded. Disable this caching behavior that requires probably too much hard drive space for most users with
mkdir -p ~/.ncbi
echo '/repository/user/cache-disabled = "true"' >> ~/.ncbi/user-settings.mkfg
Change to directory example
and follow the instructions in example/README.
Copy the file VARUSparameters.txt
from the example folder to your working directory and adjust it if necessary:
Most important parameters:
--batchSize specifies how many reads should be downloaded in each iteration (e.g. 50000 or 200000)
--maxBatches specifies how many batches should be downloaded at most
The final output is a sorted spliced alignment file (all batches together) called VARUS.bam.
Please cite: VARUS: sampling complementary RNA reads from the sequence read archive. 2019; BMC Bioinformatics, 20:558
Find the bachelor thesis of Willy Bruhn corresponding to VARUS in /docs/Thesis.