This repository contains script to assess the quality of RNAseq data sets.
This script analyzes all reads in a FASTQ file. If multiple k-mers match an rRNA sequence, the read is classified as rRNA derived. This allows the comparison of different FASTQ files regarding the contamination of rRNA in different samples.
Usage
python3 rRNA_check.py --fastq <FILE> --rRNA <FILE> --out <DIR>
Mandatory:
--fastq STR FASTQ input file
--rRNA STR rRNA FASTA input file
--out STR Output folder
Optional:
--kmer STR Kmer size [21]
--cutoff STR Cutoff for read classification [3]
--fastq
specifies a FASTQ input file that is analyzed. Each read is checked for the presence of rRNA sequence k-mers.
--rRNA
specifies a FASTA input file that contains rRNA sequences of the species of interest. These sequences are processed to extract k-mers.
--out
specifies an output folder. This folder will be created if it does not exist already.
--kmer
specifies the size of k-mers used to screen reads for rRNA similarity. Default value is 21.
--cutoff
specifies the number of k-mers that need to be detected in a read in order to classify it as rRNA read. Default is 3.
Example:
This script analyzes the coverage of RNA-seq reads across transcripts. The results allow conclusions about the quality of the processed sample.
Usage1
python3 read_distr_checker.py --bam <FILE> --gff <FILE> --out <DIR>
Usage2
python3 read_distr_checker.py --cov <FILE> --gff <FILE> --out <DIR>
Mandatory:
--bam STR BAM input file | --cov STR Coverage input file
--gff STR GFF input file
--out STR Output folder
Optional:
--sample STR Sample name
--samtools STR Path to samtools [samtools]
--bedtools STR Path to bedtools [bedtools]
--chunks INT Number of chunks [100]
--minexpcut INT Minimal coverage [100]
--bam
specifies a BAM input file. This will be converted into a coverage file (COV) to analyze the distribution of reads across transcripts. This argument can also be used to provide a comma-separated list of files for automatic processing of large batches of files.
--cov
specifies a COV input file. This file is the basis to analyze the distribution of reads across transcripts. This argument can also be used to provide a comma-separated list of files for automatic processing of large batches of files.
--gff
specifies a GFF input file. This file is processed to identify the positions of exons that form an intron.
--out
specifies an output folder. If this folder does not exist, it will be created.
--sample
specifies sample names. This argument can also be used to provide a comma-separated list of sample names for automatic processing of large batches of files. The length of this list should match the number of BAM/COV files provided.
--samtools
specifies the full path to samtools. Default: samtools.
--bedtools
specifies the full path to genomeCoverageBed. Default: genomeCoverageBed.
--chunks
specifies the number of chunks to create for each transcript when assessing the coverage. Default: 100. Only transcripts with at least this length will be considered for the analysis.
--minexpcut
specifies the minimal total number of sequenced basis. Only transcripts with at least this number of sequenced bases is considered for the analysis.
Python3 with standard modules is required to run this analysis. Plotting is based on matplotlib and seaborn.
sudo apt update
sudo apt install python3
sudo pip3 install matplotlib
sudo pip3 install seaborn
Samtools is needed to handle BAM files.
sudo apt update
sudo apt install samtools
Bedtools is needed to calculate coverage values.
sudo apt update
sudo apt install bedtools