Docker container recipe of miRDP2 pipeline for miRNA identification

miRDP2

The miRPD2 is a flexible pipeline for identification of miRNA from miRNA-seq data, with updated criteria for plant. The source code was hosted on sourceforge. This images was built based on miRDP2 v1.1.4, check the detail from it's manual.

Cite

Zheng Kuang, Ying Wang, Lei Li, Xiaozeng Yang, miRDeep-P2: accurate and fast analysis of the microRNA transcriptome in plants, Bioinformatics, Volume 35, Issue 14, July 2019, Pages 2521–2522, https://doi.org/10.1093/bioinformatics/bty972

Image

docker pull obenno/mirdp2_pipeline:0.01

Data preparation

ncRNA contaminant

The input reads will have to be filtered to discard contaminants from other short RNAs (tRNA, snRNA, snoRNA, ribozyme etc.) before being mapped to the genome.

The ncRNAs other than miRNA were extracted from Rfam's datasets (v14.5). The family information and fasta files were downloaded from Rfam's FTP server.

## Download ncRNA fasta
wget -c http://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/database_files/family.txt.gz
zcat family.txt.gz |cut -f 1,19 | awk -F"\t" '$2~/tRNA|snRNA|snoRNA|ribozyme/'| awk '{print "wget -c --directory-prefix=./fasta/ http://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/fasta_files/"$1".fa.gz"}' | bash

## Build bowtie index
cat fasta/RF*.fa.gz | gunzip -c > miRDP2_ncRNA_rfam_v14.5.fa
bowtie-build -f ./miRDP2_ncRNA_rfam_v14.5.fa ./index/rfam_index

The rfam_index was prepared and packaged into the image.

Known miRNA

If refFile provided, miRNA will be compared with reference set, and be named accordingly The refFile should contain 6 columns: 5p_seq, 3p_seq, 5p_name, 3p_name, precursor_name, precursor_seq all known miRNA should be sorted and uniqufied by 5p_seq, 3p_seq (bedtools groupby)

example for prepare refFile from PmiREN2.0 soybean known miRs

cut -f 1,13,15,18,19,22 Glycine_max_basicInfo.txt |
    awk '
    BEGIN{OFS="\t"}
    NR==1{print}
    NR>1{k=length($2); sub("*","", $5); if(index($2,$4)<=k/2){$3=$3"-5p";$5=$5"-3p"}else{$3=$3"-3p";$5=$5"-5p"}; print}
    ' |
    awk 'NR>1{if($3~/5p$/){print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6}else{print $1"\t"$2"\t"$5"\t"$6"\t"$3"\t"$4}}' |
    awk '{print $3"\t"$4"\t"$5"\t"$6"\t"$1"\t"$2}' |
    sort -k 2,2 -k 4,4 -k 1,1 -k 3,3 |
    bedtools groupby -g 2,4 -c 1,3,5,6 -o collapse > mature_precursor_seq.tsv

Genome bowtie index

Genome fasta file has to be indexed by bowtie before running. The index file name (prefix) could be passed to option -x.

Example for soybean genome:

cd /home/ubuntu/data/database/Gmax_508_Wm82.a4.v1/assembly
bowtie-build --threads 30 Gmax_508_v4.0.fa Gmax_508_v4.0.fa

Running Command

Assume the input files are in the current directory, the reference genome and its bowtie idnex file are located in /home/ubuntu/data/database/Gmax_508_Wm82.a4.v1/assembly. The current directory will be bound to /data, and folder containing reference genome and index will be bound to /database. Reference genome file is Gmax_508_v4.0.fa, and its index name is also Gmax_508_v4.0.fa.

docker run -d --rm -v /home/ubuntu/data/database/Gmax_508_Wm82.a4.v1/assembly:/database \
           -v $(pwd):/data -u $(echo $UID):1000 obenno/mirdp2_pipeline:0.01 \
           -g /database/Gmax_508_v4.0.fa \
           -x /database/Gmax_508_v4.0.fa \
           -b /data/test_inputList \
           -t -d /data/mature_precursor_seq.tsv -p 40 -o /data/OutputDir

Please remember to add -q when using single fq, fastq, fq.gz or fastq.gz file as input.

Output Files

The two major outputs are merged_out_filter_P_prediction_reformatted_withFamilyInfo.tsv and combined.expr.tsv. The former one contains detail information of miRNA loci, while the latter one contains the CPM/TPM expressions of mature miRNAs.

Header of merged_out_filter_P_prediction_reformatted_withFamilyInfo.tsv:

Known/Novel	Locus ID	Locus Position	5-prime Mature Family	5-prime Mature Sequence	5-prime Representative Reads ID	3-prime Mature Family	3-prime Mature Sequence	3-prime Representative Reads ID	Precusor Family	Precusor Sequence

Header of combined.expr.tsv:

Mature Seq ID	miRNA Family	Mature Sequence	Sample IDs

Full Options

Main script is modified from miRDP2 v1.1.4.

miRDP2-v1.1.4_pipeline.bash <OPTIONS>

PLEASE REFER TO MIRDP2 MANUAL ALSO.

If input is single file with fasta format, then it will be treated
as pre-formatted fasta, and no trimming or collapsing will be conducted.

  OPTIONS:
    NECESSARY:
    -g/--genome <file>        reference genome file in fasta format
    -x/--index <path>/prefix  bowtie index file corresponding to reference genome
    -f/--fasta                formatted sRNA-seq file in fasta format
    -q/--fastq                unformatted sRNA-seq file in fastq
    -i/--input                input file, use batch option for multiple input files
    -b/--batch                list file of input files
    -o/--output <path>        path to output folder

    OPTIONAL:
    -d/--db                   reference miRNA information in tsv format (for family identification)
    -t/--trim                 switch for adapter trimmer step, default false
    -a/--adapter              adapter sequence to be trimmed [TGGAATTCTCGGGTGCCAAGG]
    -L/--locate <int>         THRESHOLD of different locations one read can map to, default is 15
    -N/--length <int>         THRESHOLD of length of precursor candidates, default is 300
    -M/--mismatch <int>       THRESHOLD of allowed mismatches for bowtie mapping, default is 0
    -R/--rpm <int>            THRESHOLD of reads rpm when filtering, default is 10
    -p/--thread <int>         number of thread used for RNAfold, default is 1
    --large-index             use this option when using large bowtie index, e.g. when using *.ebtwl files
    -h/--help                 print this help

Dockerfile

Dockerfile was written referring to https://pythonspeed.com/articles/activate-conda-dockerfile/ and Dockstore's docs.

Issues

The -j of RNAfold doesn't work in the container.

Treywea/miRDP2_docker