RNA exome preprocessing pipeline

Pipeline used for Hulstaert et al., Charting Extracellular Transcriptomes in The Human Biofluid RNA Atlas, Cell Reports (2020), https://doi.org/10.1016/j.celrep.2020.108552

Scripts

Preparation

BasespaceRunDownloader.py

Set-up

You first have to get an accession token from https://developer.basespace.illumina.com/dashboard > Create a new Application > Fill in the fields (doesn't matter what you fill in) > Go to your new project > Credentials > Your Access Token.

Downloading sequencing runs.

BaseSpaceRunDownloader works with python 2.7, on the HPC e.g. ml Python/2.7.15-intel-2018b.

$ BaseSpaceRunDownloader.py -h

  Usage: BaseSpaceRunDownloader.py [-h] -p PROJECT -a TOKEN [-d DIRECTORY]

  Options:
    -h, --help            show this help message and exit
    -p PROJECT, --project=PROJECT
                          Project Name: required
    -a ACCESSTOKEN, --accesstoken=ACCESSTOKEN
                          Access Token: required if BASESPACE_API_TOKEN env
                          variable is not set
    -d DIRECTORY, --directory=DIRECTORY
                          Where the project directory is created, [default: ./]

You can find the Project Name in the BaseSpace: e.g. NSQ_Run532.

Fill in the project name, accesstoken and directory and the download should automatically start.

Reference files

You need to have (or create) a GTF file of your regions of interest (including spikes if needed), a STAR index (see manual of STAR), a chromosome and an exon file (example of the last 2 files are in this GitHub folder).

Main script

Usage

Make sure to run the pipeline with Python3 This pipeline makes use of several other tools that need to be installed or loaded first (change module loading in script to the version you have in case of submission to server)

FastQC - for quality control
FASTX-toolkit - for quality filtering
Cutadapt - for adapter trimming
Seqtk - for downsampling FASTQ files
STAR - for mapping
HTSeq - for quantification
Picard - for PCR duplicate removal
SAMtools - for sorting BAM file and index statistics
RSeQC - for strandedness info

$ python RNAexome_preprocessing.py -h

usage: RNAexome_preprocessing.py [-h] -t {se,pe} -b base_dir -o output_dir -m
                                 string_match -s {yes,no} --dedup {no,picard}
                                 [--polyA {yes,no}] [--adapter {truseq,no}]
                                 [--subs {no,start}] [-n N]
                                 [--repeat {yes,no}] -u user_email

Make RNA exome preprocessing scripts (and run on command line or submit to
slurm server)

optional arguments:
  -h, --help            show this help message and exit
  -t {se,pe}            Single end (se) or paired end (pe) data
  -b base_dir           Base directory where the sample subdirectories are
                        located
  -o output_dir         Directory where output should be created
  -m string_match       String match to select sample folders in base
                        directory e.g. RNA0
  -s {yes,no}           (reverse) stranded sequencing or not (unstranded)?
  --dedup {no,picard}   Method of duplicate removal [no, picard]
  --polyA {yes,no}      PolyA trimming needed? [default: no]
  --adapter {truseq,no}
                        Adapter trimming needed? If so, which one [default:
                        no]
  --subs {no,start}     Subsampling needed? [default: no] If so, when? At
                        start or after clumpify duplicate removal
  -n N                  Nr of reads to subsample to. If you need the analysis
                        to stop in order to determine the subsampling level
                        first, do not enter a number here (or -n 0)
  --repeat {yes,no}     Repeated analysis? [default: no] If yes: general fastq
                        copying and filtering will not be repeated
  -u user_email         Submitter email address (used for error reporting)

This will create, for each sample in your base directory, individual jobscripts to run subparts of the pipeline which can be run on command line.

e.g. bash 00_combinecopy_RNA0001.sh

If you have a server that accepts slurm, you can comment out lines in the script to directly submit jobs with dependencies.

Output

In the output directory, there will be a number of files and folders created. The directory will look like this:

.
└── <samplename>                                                  # Main sample directory
    ├── dedup_picard-subs_none                                    # Subdirectory according to chosen deduplication and subsampling
    │   ├── 05_srout                                              # Directory that contains all BAM files
    │   │   ├── <samplename>.Aligned.sortedByCoord.out.bam        # Initial BAM file (mapping with STAR)
    │   │   ├── <samplename>.Aligned.sortedByCoord.picard.bam     # BAM after duplicate removal
    │   │   ├── <samplename>.Aligned.sortedByCoord.picard.bam.bai
    │   │   ├── <samplename>.Aligned.sortedByName.picard.bam
    │   │   ├── <samplename>.Log.final.out
    │   │   ├── <samplename>.Log.out
    │   │   ├── <samplename>.Log.progress.out
    │   │   ├── <samplename>.SJ.out.tab
    │   │   ├── <samplename>._STARgenome
    │   │   │   ├── ...
    │   │   ├── <samplename>._STARpass1
    │   │   │   ├── Log.final.out
    │   │   │   └── SJ.out.tab
    │   │   ├── <samplename>.Unmapped.out.mate1
    │   │   ├── <samplename>.Unmapped.out.mate1.gz
    │   │   ├── <samplename>.Unmapped.out.mate2
    │   │   └── <samplename>.Unmapped.out.mate2.gz
    │   ├── 07_picardout            
    │   │   └── <samplename>_picard_dup.metrics                     # Picard duplicate removal stats
    │   ├── 09_htout
    │   │   └── <samplename>_htseq_counts.txt                       # HTSeq quantification output
    │   ├── 10_idxout
    │   │   └── <samplename>_idxstat.txt                            # SAMtools index stats
    │   ├── <samplename>_RSeQC_output_all.txt                       # RSeQC strandedness information
    │   ├── <samplename>_RSeQC_output.txt
    │   └── scripts                                                 # Folder that contains all scripts to run
    │       ├── 00_combinecopy_<samplename>.sh
    │       ├── 01_fastqc_<samplename>_orig.sh
    │       ├── 02_trimadapter_<samplename>.sh
    │       ├── 03_removebadqc_RNA010101.sh
    │       ├── 04_fastqc_<samplename>_filt.sh
    │       ├── 05_alignSTAR_<samplename>.sh
    │       ├── 06_strandedness_<samplename>.sh
    │       ├── 07_removedupPicard_<samplename>_picard.sh
    │       ├── 08_sortbambyname_<samplename>_picard.sh
    │       ├── 09_countshtseq_<samplename>.sh
    │       ├── 10_idxstat_<samplename>.sh
    │       └── 11_zipfastq_<samplename>.sh
    ├── FASTQC_original                                               # FASTQC of original FASTQ files
    │   ├── <samplename>_1_fastqc.html
    │   ├── <samplename>_1_fastqc.zip
    │   ├── <samplename>_2_fastqc.html
    │   └── <samplename>_2_fastqc.zip
    ├── FASTQC_qc                                                     # FASTQC of adapter trimmed and filtered FASTQ files
    │   ├── <samplename>_1_qc_fastqc.html
    │   ├── <samplename>_1_qc_fastqc.zip
    │   ├── <samplename>_2_qc_fastqc.html
    │   └── <samplename>_2_qc_fastqc.zip
    ├── qc_lines.txt                                                  # number of lines remaining after adapter trimming and filtering (divide by 4 to get number of reads)
    ├── <samplename>_1.fastq.gz                                       # Original FASTQ file R1
    ├── <samplename>_1_qc.fastq.gz                                    # Adapter trimmed and filtered FASTQ file R1
    ├── <samplename>_2.fastq.gz                                       # Original FASTQ file R2
    └── <samplename>_2_qc.fastq.gz                                    # Adapter trimmed and filtered FASTQ file R2

OncoRNALab/RNAexome