/read-split-fly

Source code for the Read-Split-Fly pipeline for discovering novel non-canonical splice junctions in RNA-Seq data

Primary LanguageC++Apache License 2.0Apache-2.0

README

This is the software package for the Read-Split-Fly pipeline. Included are the scripts and software necessary to run the entire process from beginning to end.

INSTALLATION

  1. Download and unzip or clone the repository to a location of your choice.
  2. Satisfy dependencies listed below.
  3. change to the installation directory
  4. type make

Satisfy Dependencies(4)

1) Perl 5.16 (or later):

The converter for the gene reference file (see below) requires Perl to run. We tested it with version 5.16, though it is likely that earlier versions will work. Download Perl.

To verify your installation of Perl is compatable, check the output of the command:

  • perl --version

2) Python 2.7 (or later):

The encoding guesser requires that python be installed on the system and its executable in an accessible location. Earlier versions may work, but that isn't guaranteed. Download Python.

To verify your installation of python is compatable, check the output of the command:

  • python --version

3) Bowtie 1.0.1 (or later):

By default, RSF comes packaged with bowtie version 1.1.2 , which is used by default. See CONFIGURATION section for using a different version.

If you choose to install it in another location, make sure it is in the PATH (see also: CONFIGURATION and DEPENDENCIES below; it is recommended that the path to bowtie be in the PATH anyway). Download Bowtie

To verify your installation of bowtie is compatable, check the output of the command:

  • bowtie --version

4) gcc(g++) version 4.8 (or later):

Compiling the associated programs requires gcc version 4.8 or higher (to accommodate features of c++11 that are used in splitpairs).

To verify your installation of bowtie is compatable, check the output of the command:

  • gcc --version

Bowtie Index files:

You will need bowtie index files and knownGene files for the genome(s) of your choice. Put the bowtie index files in the directory specified by the BOWTIE_INDEXES variable in config.sh. This sets the environment variable bowtie uses locally during RSF execution. By default, RSF is set to look in BASE_DIR/bt/indexes . More on this in the CONFIGURATION section below.

Indexes can be found on the bowtie website. We use the human hg19 and mouse mm9 genomes.

refFlat files:

For the genomes to which you plan to align, download and uncompress their refFlat reference file from UCSC. Place the refFlat.txt file in the same directory as your bowtie indexes, and change the name of the file to have the following pattern: OrganismAssemblyName.refFlat.txt

Note: These files are case sensitive.

The splitPairs portion of Read-Split-Fly requires a special parsed refFlat reference with intron/exon boundaries identified. We provide a script to create this file called refflat_parse_RSW.pl in the BASE_DIR. Run this script and supply the refFlat reference file as the input argument.

  • perl refflat_parse_RSW.pl /usr/local/bowtie/indexes/hg19.refFlat.txt
    • This generates the annotated file hg19.refFlat.txt.intronBoundary.exonsgaps

CONFIGURATION (optional):

There is no mandatory configuration that needs be done if you have followed the installation instructions to this point. The following are presented as options for the advanced user.

The configuration file config.sh contains all the configurable values used by the pipeline. It can be edited with any plain-text editor (nano, vim, etc). You can change the following variables in the USER CONFIGURATION section to suit your needs:

  • RM_TEMP_FILES Set to 1 to delete intermedite files at the end of RSF execution, 0 to keep them

    • Default: 1
  • NUM_THREADS Number of concurrent threads to use for bowtie alignment steps

    • Default: 4
  • BASE_TEMP_DIR: With default settings, location where different intermediate files are stored

    • Default: BASE_DIR/tmp
  • BOWTIE_TEMP_DIR: location to store intermediate bowtie files.

    • Default: BASE_DIR/tmp/bowtie
  • SPLIT_TEMP_DIR: location to store intermediate split reads files.

    • Default: BASE_DIR/tmp/split
  • RSR_TEMP_DIR: location to store intermediate RSR options and output files.

    • Default: BASE_DIR/tmp/splitpairs
  • LOG_DIR: location to store diagnostic and operational logs.

    • Default: BASE_DIR/logs
  • REFDIR: Directory containing refFlat files (and gene Intron/exon boundary files). All files for your available genomes must be in this directory.

    • Default: BOWTIE_INDEXES
  • BOWTIE_PROGRAM: the absolute path to bowtie executable (e.g. /usr/bin/bowtie).

    • Default: BASE_DIR/bt/bowtie

All other variables are internal-use and should not be changed. Read the comments in the configuration file for details as to what function they provide.

DEPENDENCIES:

All shell scripts (files ending in .sh) rely on rsf_config.sh, which holds the configuration data. Other scripts call upon other execuables as needed, depicted below.

rsf_batch_job.sh
|---pipeline.sh
|---bowtie.sh
|   |---bowtie v1.0.1 or newer
|   |---guess-encoding.py
|       |---Python 2.7 or newer
|---split.sh
|   |---srr
|---sfc 
|---splitPairs.sh
|   |---sp4

compare_sh
|---compare

refFlat_parse_RSW.pl
|---Perl 5.16 or newer

sbc
|---No dependencies

RUNNING:

To run the pipeline, first, set the BOWTIE_INDEXES variable in config.sh to the location of your bowtie indexes directory.

After that, you can execute rsf_batch_job.sh with the following inputs, in order:

  • mode:

    • analytic or comparison
    • Analytic jobs produce RSF output files. This is the standard mode.
    • Comparative jobs produce RSF output files for both data sets and a file which shows the differences between the two.
  • genome:

    • The assembly name of the bowtie index for the genome to which to align reads. Also specifies which refFlat file to use (see INSTALLATION).
  • readsFile:

    • The file(s) with RNA-Seq data in plain-text FASTQ format. The nature of your run will determine how you should specify your files.
      • Single-ended, no replicates:
        • "file_name_with_full_path"
      • Single-ended with replicates:
        • "replicate1.fastq,replicate2.fastq,..."
      • Paired-ended, no replicates:
        • "left-data.fastq|right-data.fastq"
      • Paired-ended with replicates:
        • "Replicate1_1.fastq,Replicate2_1.fastq|Replicate1_2.fast1,Replicate2_2.fastq"
        • Make sure your pairs are ordered correctly:
          • "REPLICATE 1 LEFT, REPLICATE 2 LEFT | REPLICATE 1 RIGHT, REPLICATE 2 RIGHT"
  • [readsFile2]:

    • For use in comparison mode, a second set of reads-files goes here,the format is the same as above.
  • maxGoodAlignments:

    • Maximum number of matches allowed in bowtie (see bowtie -k and -m parameters).
  • minSplitSize:

    • Smallest length to split your reads into. If you specify more than half the reads' length, the pipeline will exchange it with (readlength - minSplitSize).
      • The smaller your split, the more memory, disk space, and time will be needed.
  • minSplitdistance:

    • Minimum distance allowed between split-reads to be considered a splice-junction candidate.
  • maxSplitdistance:

    • Largest amount of distance between split-reads to be considered a splice-junction candidate.
  • regionBuffer:

    • Maximum distance between the start-position of candidate junctions considered for support.
  • requiredSupports:

    • Minimum number of supporting reads a splie-junction candidate must have to be reported.
  • pathToSaveResults:

    • Path to directory where RSF results will be stored.
  • BLAST e-value:

    • e-value passed to BLAST to query RSF results against miRNA and u12db databases.
    • If set to 0, this post-processing step will be ignored.

Examples:

Here are presented example command-lines for doing various kinds of runs the assembly names are real but the file names are made-up...

Analytic runs:

Normal Run:

rsf_batch_job.sh analytic mm9 "mus1.fastq" 11 11 3 30000 4 2 ~/mm9_results 0.1

Normal Run with Replicates:

rsf_batch_job analytic hg19 "hg19-1.fastq,hg19-2.fastq,hg19-3.fastq" 2 30 2 50000 5 2 ~/hg19_results 0.01

Paired-Ended Run:

rsf_batch_job.sh analytic hg19sp101 "hg19_1.fastq|hg19_2.fastq" 11 33 3 100000 5 2 ~/hg19_paired_results .001

Paired-Ended Run with Replicates:

rsf_batch_job analytic hg19sp101 "hg19-1_1.fastq,hg19-2_1.fastq|hg19-1_2.fastq,hg19-2_2.fastq" 11 33 3 50000 5 2 ~/hg19_pair_repl_results .1

Comparative runs:

Normal Run:

rsf_batch_job.sh comparison mm9 "set1.fastq" "set2.fastq" 2 15 3 30000 4 2 ~/mm9_compare 0.1

Normal Run with Replicates:

rsf_batch_job analytic hg19 "set1replicate1.fastq,set1replicate2.fastq" "set2replicate1.fastq,set2replicate2.fastq" 2 33 3 50000 5 2 ~/hg19_results 0.1

Paired-Ended Run:

rsf_batch_job.sh analytic hg19sp101 "set1_1.fastq|set1_2.fastq" "set2_1.fastq|set2_2.fastq" 11 33 3 50000 5 2 ~/hg19_paired_results 0.1

KNOWN ISSUES

  • The quality-encoding detection portion of bowtie.sh is known to cause a broken pipe with awk. This is acceptable and does not interfere with the performance of the pipeline.
  • There are currently some extra dependencies required for GNU- or Linux-specific software that we are actively working on removing. This may show if if you are trying to run RSF on a Mac or in a stripped down Unix or Linux distribution.

COPYRIGHT

For questions, please contact Jeff Kinne jkinne@cs.indstate.edu

Read-Split-Run is copyright(c) 2014-2015 Yongsheng Bai, Brandon Donham, Randal J. Kaufman, Jeff Kinne.
Read-Split-Fly is copyright(c) 2015-2016 Yongsheng Bai, Jeff Kinne, Aaron Cox, Feng Jiang, Siva Dharman Naidu.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 a copy is also provided in the LICENSE file, accompanying this.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Tools and Databases Developed Elsewhere

RSF has incorporated several tools and databases into its pipeline. We would like to thank their creators for their contributions to the field.

BOWTIE

This package of Read-Split-Fly installs Bowtie version 1.1.2 and uses it by default. Bowtie is licensed under the Artistic License, a copy of which can be found here. We have modified the Makefile and bowtie_inspect.cpp, which we are including as bowtie_inspect_RSR.cpp.

These modified files are being released as part of Read-Split-Fly under the Apache License, Version 2.0 .

Downstream Processing

To further extend the userfulness of the Read-Split-Fly software, we have built in optional downstream processing into the pipeline. We use the BLAST+ suite to compare various nucleotide sequences found in the miRBase and U12DB databases against the candidate splice junctions identified by Read-Split-Fly.

BLAST+

BLAST Homepage

BLAST+ Article

miRBase

miRBase Homepage

miRBase Article

U12DB

U12DB Homepage

U12DB Article