CNproScan is R package developed for CNV detection in bacterial genomes. It employs Generalized Extreme Studentized Deviate test for outliers to detect CNVs in read-depth data with discordant reads detection to annotate the CNVs type.
The not-updated Matlab version is here: https://github.com/robinjugas/CNproScanMatlab
This is the latest version v1.0 For previous versions see Tags/Releases.
Package was tested on R 4.x with several dependencies: parallel, foreach, doParallel, seqinr, Rsamtools, GenomicRanges, IRanges, data.table.
devtools::install_github("robinjugas/CNproScan")
Several input files are neccessary:
- reference sequence FASTA file used in the read alignment
- sorted and indexed BAM file from the read-aligner (BWA assumed)
- coverage file (including zero values with -a swtich)
- genome mappability file - obtained by GENMAP (https://github.com/cpockrandt/genmap) - only for the mappability normalization
- origin of replication position/s - obtained from DoriC (https://origin.tubic.org/doric/browse/bacteria) - only for the oriC normalization
bwa index -a is reference.fasta
samtools faidx reference.fasta
bwa mem reference.fasta read1.fq read2.fq > file.sam
samtools view -b -F 4 file.sam > file.bam # mapped reads only
samtools sort -o file.bam file1.bam
samtools index file.bam
samtools depth -a file.bam > file.coverage
genmap index -F reference.fasta -I mapp_index
genmap map -K 30 -E 2 -I mapp_index -O mapp_genmap -t -w -bg
R script:
library("CNproScan")
# Working directory with files
setwd("workdir")
# File paths
fasta_file <- "reference.fasta"
bam_file <- "file.bam"
coverage_file <- "file.coverage"
bedgraph_file <- "mapp_genmap.bedgraph"
# For only GC normalization
DF <- CNproScanCNV(coverage_file, bam_file, fasta_file,
GCnorm=TRUE, MAPnorm=FALSE, ORICnorm=FALSE, cores=4)
# Without any normalization
DF <- CNproScanCNV(coverage_file, bam_file, fasta_file,
GCnorm=FALSE, MAPnorm=FALSE, ORICnorm=FALSE, cores=4)
# Both GC normalization and mappability normalization
DF <- CNproScanCNV(coverage_file, bam_file, fasta_file,
GCnorm=TRUE, MAPnorm=TRUE, ORICnorm=FALSE, bedgraph_file, cores=4)
# Both GC normalization, mappability normalization and OriC normalization
DF <- CNproScanCNV(coverage_file, bam_file, fasta_file,
GCnorm=TRUE, MAPnorm=TRUE, ORICnorm=TRUE, bedgraph_file, oriCposition=1, cores=4)
# or with multiple oriC positions
DF <- CNproScanCNV(coverage_file, bam_file, fasta_file,
GCnorm=TRUE, MAPnorm=TRUE, ORICnorm=TRUE, bedgraph_file, oriCposition=c(10,5000), cores=4)
Caution : OriC normalization is working only in single-chromosome mode!
# Write VCF file (additional function from the package)
writeVCF(DF, "fileName.vcf")
# write TAB-separated file (optional)
write.table(DF, file = "TSVfile.tsv", row.names=FALSE, col.names = TRUE, sep="\t")
- coverage_file = path to the .coverage file
- bam_file = path to the .bam file
- fasta_file = path to the .fasta file
- GCnorm = TRUE/FALSE whether to do GC bias normalization
- MAPnorm = TRUE/FALSE whether to do mappability normalization
- bedgraph_file = path to the bedgraph file outputed from genmap tool.
- ORICnorm = TRUE/FALSE whether to do origin of replication bias normalization
- oriCposition = single integer or vector c() with multiple oriC locations
- cores = number of threads for foreach %dopar%. Recommended default = 2, or more.
- dataframe containing the detected CNVs
- VCF file named sample_cnproscan.vcf in the working directory
- Tweaked CNV detection
- Tweaked identification of CNV type
- GC and mappability normalization modified and tweaked
- new oriC normalization
- VCF output
- multi chromosome/contig support
BWA ignores the rest of FASTA header after the first whitespace. CNproScan expects all the headers to be the same. That means, the FASTA headers, BAM RNAME names and coverage file from samtools contain the same contig/chrosome names. The package uses seqinr::read.fasta where whole.header==FALSE crops header at the first whitespace. If this behaviour is issue, please post it as github issue.
Robin Jugas, Karel Sedlar, Martin Vitek, Marketa Nykrynova, Vojtech Barton, Matej Bezdicek, Martina Lengerova, Helena Skutkova, CNproScan: Hybrid CNV detection for bacterial genomes, Genomics, Volume 113, Issue 5, 2021, Pages 3103-3111, ISSN 0888-7543, https://doi.org/10.1016/j.ygeno.2021.06.040. (https://www.sciencedirect.com/science/article/pii/S0888754321002779)