microRNA expression analysis of NextFlex small-RNA library prep
The analysis consists of the following steps:
0. Do qc on input data using fastqc
1. trim illumina adapter using trimGalore
The sequencing artifacts should be removed before quantification or else the reads will not be mapped.
2. Remove 4 random nucleotides at the begin/end of the reads
This is another artifact introduced by the nextflex protocol. You can also store this to try and remove PCR duplicates. in this implentation 1 read is kept is observed once and 2 reads are kept if observed >=2.
3. Collapse data
Data reduction step. Convert the reads in the fasta like mirdeep2 format
>>SEQ_ID_xCOUNT
NNNN
SEQ -> Sample identifier matching
[a-zA-Z][a-zA-Z0-9]{2}
ID -> read identifier matching
\d+
COUNT -> count of times the sequence is observed matching
\d+
NNNN -> Actual sequence matching
[ATCGNatcgn]+
4. Quantify using miRDeep2
Generate a microRNA expression profile using miRBase 21
code used for project
convert flowexport
For easy use in bash as sourceable object some conversion is needed.
#libreoffice -calc 161208_NB501043_0090_AH55LGBGX2.csv #for adding read files and whateversuits your fancy
cp ../../raw/161208_NB501043_0090_AH55LGBGX2/161208_NB501043_0090_AH55LGBGX2.csv ./
perl -i -wpe 's/,,/,undef,/g;s/,,/,undef,/g;' 161208_NB501043_0090_AH55LGBGX2.csv
perl -i -wpe 's/[\@\(\) \-]/_/g;s/,/\t/g' 161208_NB501043_0090_AH55LGBGX2.csv
perl -wane 'use Data::Dumper; if($. == 1){our $obj ; %{$obj} = ();@{ $obj -> {'array'} } = @F;warn Dumper $obj}else{for my $idx (0..(scalar(@{$obj->{'array'}})-1)){$obj->{'array'}->[$idx] =~ s/\//_/g;$F[$idx] =~ s/[^\w\/\.\-]/_/g;my $topr=$obj->{'array'}->[$idx]."=".$F[$idx].";"; print $topr; }print "\n"}' 161208_NB501043_0090_AH55LGBGX2.csv > 161208_NB501043_0090_AH55LGBGX2.source.table.sh; tail -n 10 *.source.table.sh
general setup
#
mkdir logs
run fastqc
fastqc for quality control
(for i in $(cat 161208_NB501043_0090_AH55LGBGX2.source.table.sh);do . <(echo $i);sbatch s00_fqqc_fastqc.sh $read1fqgz $Sample $lane; done)
run trim galore
adapter trimming tool for illumina adapter removal.
(for i in $(cat 161208_NB501043_0090_AH55LGBGX2.source.table.sh);do echo $i ; . <(echo $i);sbatch s01_fqtrim_trimgalore.sh $read1fqgz $Sample $lane; done)
Trim n4 adapters
Remove the random 4 nt sequence at the end and the start of the read and integrate it into the readname. also fumble up a shortname based on a name index.
(for i in $(cat 161208_NB501043_0090_AH55LGBGX2.source.table.sh);do . <(echo $i);if [ $lane -eq 1 ] ; then echo sbatch s02_trimN4.sh $read1fqgz $Sample "S"$(echo $Sample| perl -wpe 's/(^\d+).*/$1/;$_ = sprintf("%02d",$_);' ) | tee >(bash); fi; done)
Collapse NextFlex Collapse
#nextflex
(for i in $(cat 161208_NB501043_0090_AH55LGBGX2.source.table.sh);do . <(echo $i);if [ $lane -eq 1 ] ; then echo sbatch s03.1_CollapseNextFlex.sh $read1fqgz $Sample "S"$(echo $Sample| perl -wpe 's/(^\d+).*/$1/;$_ = sprintf("%02d",$_);' ) | tee >(bash); fi; done)
#normal
(for i in $(cat 161208_NB501043_0090_AH55LGBGX2.source.table.sh);do . <(echo $i);if [ $lane -eq 1 ] ; then echo sbatch s03.Collapse.sh $read1fqgz $Sample "S"$(echo $Sample| perl -wpe 's/(^\d+).*/$1/;$_ = sprintf("%02d",$_);' ) | tee >(bash); fi; done)
Run quantification
cat nfbccollapse/*/*.fa > nf.sample.all.reads.fa
cat collapse/*/*.fa > default.sample.all.reads.fa
cp /groups/umcg-griac/tmp04/projects/umcg-mterpstra/VandenBerge_MicroRNAseq/mirbase21 ./ -r
paste <(ls collapse/*/*.fa) <(ls collapse/*/*.fa | perl -wpe 's!.*\/(S\d\d).*.fa!$1!g;')> default.sample.config
paste <(ls nfbccollapse/*/*.fa) <(ls nfbccollapse/*/*.fa | perl -wpe 's!.*\/(S\d\d).*.fa!$1!g;')> nf.sample.config
sbatch s04_Quantifier.sh r2 10JAN2017 1 2 3 4
sbatch s04_Quantifier_customfa.sh r1 12JAN2017 1 2 3 4
Get Readcounts of fastq file from fastqc data
for i in $(ls fastqc/10_1610_Ong_SmallRNAseq_Batch2_1/161208_NB501043_0090_AH55LGBGX2_L1_TAGCTT_fastqc.html ); do echo -ne $i"\t"; perl -wpe 's!</tr>!</tr>\n!g;s!\</{0,1}t[dr]\>\</{0,1}t[dr]\>!\t!g' $i | grep 'Total'; done