For compilation sbt is needed (tested on version 1.1.2). For running Apache Spark 2.4 compiled against scala 2.11 is required (tested on spark 2.4.3).
cd bigwig
sbt clean assembly
spark-submit --executor-cores 12 --executor-memory 60G --class genomeqc.CoverageMain genomeqc-assembly-0.1.jar "/data/*.bam" 10 /data/output
- Path to input BAM file or files. Quotes are necessary, so that the asterisk is interpreted by spark, not by shell.
- Coverage threshold. All regions with coverage below this value with be considered as low coverage regions.
- Path to output directory, where result files will be written to.
This program takes one or more sorted BAM files and for every input file computes coverage, finds regions with coverage below specified value and saves those regions in BED files (one BED file for one input BAM).
Afterwards it finds intersection of low coverage regions from all input files (regions, that have low coverage in every BAM file) and saves it to BED file.
spark-submit --executor-cores 12 --executor-memory 60G --class genomeqc.Mappability genomeqc-assembly-0.1.jar "/data/*.bam" 10 /data/output
- Path to input BAM file or files. Quotes are necessary, so that the asterisk is interpreted by spark, not by shell.
- Coverage threshold. All regions with coverage below this value with be considered as low coverage regions.
- Path to BigWig file with mappability track.
- Mappability threshold. All regions with score less than this value will be considered as low mappability regions.
- Path to output directory, where result files will be written to.
This program finds intersection of low coverage regions from all BAM files and low mappability regions and saves it to BED file.
spark-submit --executor-cores 12 --executor-memory 60G --class genomeqc.LowCoveredGenesMain genomeqc-assembly-0.1.jar "/data/*.bam" 10 /data/homo_sapiens.gtf 0.6 /data/output
- Path to input BAM file or files. Quotes are necessary, so that the asterisk is interpreted by spark, not by shell.
- Coverage threshold. All regions with coverage below this value with be considered as low coverage regions.
- Path to GTF file.
- Gene intersection ratio threshold. If low coverage regions make at least (100 * threshold) % of a gene, this gene will be included in the result.
- Path to output directory, where result files will be written to.
This program finds intersection of low coverage regions from all BAM files and combines it with gene data.
For every gene it calculates percentage of low covered nucleotides and saves to BED file genes with percentage above specified threshold.
spark-submit --executor-cores 12 --executor-memory 60G --class genomeqc.GeneSummaryMain genomeqc-assembly-0.1.jar "/data/*.bam" 10 /data/homo_sapiens.gtf /data/output
- Path to input BAM file or files. Quotes are necessary, so that the asterisk is interpreted by spark, not by shell.
- Coverage threshold. All regions with coverage below this value with be considered as low coverage regions.
- Path to GTF file.
- Path to output directory, where result files will be written to.
This program finds intersection of low coverage regions from all BAM files and combines it with gene data.
For every gene it calculates percentage of low covered nucleotides and saves it to BED file.
spark-submit --executor-cores 12 --executor-memory 60G --class genomeqc.HighCoveredBaitsMain genomeqc-assembly-0.1.jar "/data/*.bam" 60 /data/covered.bed /data/output
- Path to input BAM file or files. Quotes are necessary, so that the asterisk is interpreted by spark, not by shell.
- Coverage threshold. All regions with coverage above this value with be considered as high coverage regions.
- Path to BED file with baits.
- Path to output directory, where result files will be written to.
This program finds intersection of high coverage regions from all BAM files and intersects it once again with bait regions.
This intersection is saved to one BED file and to seconds BED file are saved only those baits that are entirely high covered.
spark-submit --executor-cores 12 --executor-memory 60G --class genomeqc.ExonSummaryMain genomeqc-assembly-0.1.jar "/data/*.bam" 60 /data/homo_sapiens.gtf /data/output
- Path to input BAM file or files. Quotes are necessary, so that the asterisk is interpreted by spark, not by shell.
- Coverage threshold. All regions with coverage above this value with be considered as high coverage regions.
- Path to GTF file.
- Path to output directory, where result files will be written to.
This program finds intersection of low coverage regions from all BAM files and combines it with gene data.
For every exon it calculates percentage of low covered nucleotides and saves to BED file genes with percentage above specified threshold.
- Genome BAM file https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG003_NA24149_father/10XGenomics/NA24149_phased_possorted_bam.bam
- Exome BAM file ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG003_NA24149_father/OsloUniversityHospital_Exome/151002_7001448_0359_AC7F6GANXX_Sample_HG003-EEogPU_v02-KIT-Av5_TCTTCACA_L008.posiSrt.markDup.bam
- GTF file ftp://ftp.ensembl.org/pub/grch37/current/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.chr.gtf.gz
- Mappability regions file http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeMapability/wgEncodeCrgMapabilityAlign100mer.bigWig
- BED file with baits ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/wex_Agilent_SureSelect_v05_b37.baits.slop50.merged.list