/mugio

Inverted Repeat Junction identifier for use with ONT (MinION) sequencing.

Primary LanguagePythonMIT LicenseMIT

mugio

Inverted Repeat Junction identifier for use with ONT (MinION) sequencing.

Mugio is intended to aid in the identification of potential inverted repeats in seqeuncing data from Oxford-Nanopore Technologies Nanopore seqeuncing platform (MinION, PromethION, GridION, etc.).

Important note for NYU HPC users: You can load all necessary modules via:

source demo/module_load.sh

Citation

Spealman P, Burrell J, Gresham D. Inverted duplicate DNA sequences increase translocation rates through sequencing nanopores resulting in reduced base calling accuracy. Nucleic Acids Res. 2020 May 21;48(9):4940-4945. doi: 10.1093/nar/gkaa206. PMID: 32255181; PMCID: PMC7229812.

Requirements

Materials

Mugio reguires the user supplied data:

  • Fastq generated by Albacore or Guppy
  • Aligned bam file generated by Minimap2

Mugio reguires the following programs:

  • gzip 1.5+
  • python 2.7+ or 3.6+
  • samtools 1.6+
  • bedtools 2.26.0+
  • (optional) The --blast command requires blast+ 2.9.0+

Mugio reguires the following python packages:

  • numpy
  • scipy.stats
  • matplotlib
  • pandas

Installation

Mugio is a stand alone python script as such it can be run locally by merely downloading the script. Installation through git clone is the preferred method. To download

git clone https://github.com/pspealman/mugio.git

To test installation

cd mugio
python mugio.py -demo

Quick Start pipeline for inverted repeat junction identification and evaluation

Command --bprd (breakpoint retrieval and definition)

  • Purpose: Identifies loci likely to be inverted repeat junctions associated with inverted duplications.
  • Format:
python mugio.py -bprd -f <fastq_file> -s <sam_file> -bam <bam_file> -o <output_path_and/or_file_prefix>
  • Demo:
python mugio.py -bprd -f demo/demo.fastq -s demo/demo.sam -bam demo/demo.bam -o demo_output/demo_bprd
  • Results: Identified likely candidates are recorded in the out_put path as a bed file with the suffix '_bprd'. Therefore if '-o demo_output/demo_bprd' the results will be stored in 'demo_output/demo_bprd_bprd.bed'

Command --evaluate

  • Purpose: Calculates the correlation (Spearman's rho) between pre-breakpoint seqeunce length and post-breakpoint low scoring region length.
  • Format:
python mugio.py --evaluate [-bpf bprd.bed | -snf sniffles.vcf] -s <sam_file> -o <output_path_and/or_file_prefix>
  • Demo:
python mugio.py --evaluate -bpf demo_output/demo_bprd_bprd.bed -f demo/demo.fastq -s demo/demo.sam -o demo/demo_bprd_lengths
  • Results: Identified candidates that have closed low-phred regions with trace figures generated with low scoring regions identified. These will be saved in the out_path path as sub folders named after the inverted repeat junctions coordinates.

Optional modes

View phred score trace of a read (--plot_phred or -pp)

  • Purpose: Generate a figure showing the Phred score of each nucleotide (blue) as well as the median phread score calculated over a 1K window (orange).
  • Format (for a single read):
python mugio.py -pp -f <fastq_file> -uid <read uid> -o <output_path_and/or_file_name>
  • Demo:
python mugio.py -pp -f demo/demo.fastq -uid c1d4f6dc-cc1f-4431-a6a0-1b9f5109342c -o c1d4f6dc.png
  • Format (for all reads in a fastq):
python mugio.py -pp -f <fastq_file> -o <output_path_and/or_file_name>
  • Demo:
python mugio.py -pp -f demo/demo.fastq -o demo_
  • Results: Figure showing and overlay of the immediate phred score and a trace of the median phred score across a rolling window. Note: the because the window is 1000 nt (by defualt) the median trace will stop 100 nt before the end of the phred score trace.

Calculate chromosome and genome coverage (--coverage_calculator or -cc)

  • Purpose: Generate a table with read depth metircs such as median depth, standard deviation, genome wide relative median depth, etc.
  • Format:
python mugio.py -cc [-filter <chromo_name>] -f <fastq_file> -s <sam_file> -o <output_path_and/or_file_prefix>
  • Demo:
python mugio.py -cc -filter NC_001224.1 -f ont_DGY1657/BC01_1657.fastq -s ont_DGY1657/BC01_1657.sam -o ont_DGY1657/cc_BC01_1657
  • Results: Tab-delimited file containing rows for each scaffold, chromosome, as well as total (assumbed to be the genome). Columns are various metrics of the aligned read-depth.

Generate a discordant read bam file (--get_discordant or -get)

  • Purpse: Ge
  • Format:
python mugio.py --get -f <sample_fastq> -s <sample_aligned.sam> -pct <float> -o <output_prefix>
  • Demo:
python mugio.py --get_discordant -f fastq/DGY1726.fastq -s bam/DNA_DGY1726.sam -pct 0 -o mugio/DGY1726_1
  • Results: Any read that is over some percentage (-pct) of non-mapping nuceltotides will get extracted and resolved into a bam alignment file.