Useful, time-saving scripts, workflows and notes for everyday use
ls *r_1.fq.gz | parallel fastqc '{}'
Prerequisites:
Effective genome size: computed with effective_genome_size.sh
script or with unique-kmers.py
from khmer
package. The second method should be applied when only uniquely mapped reads are used for the analysis.
unique-kmers.py -k [MEAN READS LENGTH] [genome.fa]
Examples (for GRCh38 downloaded from GENCODE:
Mean reads length 28bp: 2414114105
Mean reads length 50bp: 2701262066
Genome in .2bit format Can be obtained from .fa using fatotwobit programm from UCSC:
faToTwoBit in.fa out.2bit
All BAM files need to be sorted and indexed before
Basic script:
computeGCBias -b file.bam --effectiveGenomeSize 270126066 -g genome.2bit -o output.txt -l 50 --biasPlot plot.png
Script to run quality check for multiple BAM files
Prerequisites:
File describing input data Example script for generation a file with input data from a sample sheet containing experimental condition names and IDs for each BAM file:
bamQC_table.R
qualimap multi-bamqc -c -d bamQC_fileslist.txt -gff regions.gtf -outdir multi_bamQC -outfile multi_bamQC.pdf -r
Indexing Run once for a particular set of reference transcripts (eg. obtained from GENCODE), -k parameter should be adjusted to the mean reads length, as it represents minimum acceptabe length for a valid match
salmon index -t gencode.v28.transcripts.fa -i gencode28_salmonindex25 --type quasi -k 25
Quantification Code for parallel (GNU parallel) processing of all .fastq.gz samples located in ./fastq directory with mean read length = 49 bp.
ls fastq/*_trim.fastq.gz | cut -d '/' -f 2 | cut -d '.' -f 1-5 | parallel salmon quant -i /home/JAK75/Documents/reference-genome/gencode28_salmonindex25 -l SF -r fastq/'{}'.fastq.gz -o transcript_quants/'{}' -p 4 --fldMean=49 --seqBias --gcBias --validateMappings --rangeFactorizationBins=4 --numBootstraps=1000
echo $(zcat file.fq.gz | wc -l) / 4 | bc
More here
cat *.fq > one.fq
ls fastq/*.fq.gz | cut -d '/' -f 2 | cut -d '.' -f 1-5 | parallel 'gunzip -c fastq/'{}'.fq.gz | grep "@" | cut -d ':' -f 10 | sort | uniq -c | sort -n | tail > index_stat/'{}'.txt'
A standard format of fasta headers in a transcriptome fasta file obtained from GENCODE is:
$ head gencode.v29.transcripts.fa
>ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|
GTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTC
TCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGA
TGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTG
CAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGG
GCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCACAACCAGGCAT
AGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCAG
TGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCAGAGCTGCAGAAGACG
ACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAGGCAGGGCCATCA
GGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTGGACCAGTGATACACCCGGCACCCTG
However, for some analysis is much convenient to use a headers connsisting from only a single transcript ID. All other features of the transcript can be easily downloaded using transcript ID and for example BioMart R package. In order to modify fasta header:
./modify_fasta_header.py gencode.v29.transcripts.fa gencode.v29.transcripts_headers.fa
And finally - check the outcome:
head gencode.v29.transcripts_headers.fa
>ENST00000456328
GTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTC
TCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGA
TGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTG
CAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGG
GCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCACAACCAGGCAT
AGGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCAG
TGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCAGAGCTGCAGAAGACG
ACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAGGCAGGGCCATCA
GGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTGGACCAGTGATACACCCGGCACCCTG
Index all .bam files in the folder. Obtained from Pierre Lindenbaum (Biostars forum)
ls *.bam | parallel samtools index '{}'