Benchmarking of pangenome sequence graph construction tools

This repository contains code used in the Master's Thesis titled:

'Functional and computational benchmarking of pangenome graph construction tools and development of a graph-based bacterial gene caller.'

Requirements

  • python3
  • biopython
  • numpy
  • R version 3+

Usage

analyse_gaf_genic.py

Analyse the proportion of genic read bases aligned by Graphaligner.

python analyse_gaf_genic.py gaffile blastfile reads.fa outfile.txt

Input/Output:

  • gaffile: graphical alignment file (GAF) produced by Graphaligner
  • blastfile: BLAST output file in tabular format generated from exact alignment of gene sequences from Lees et al. (aligned files are '*_dna.fa') to respective simulated genomes (blastn -outfmt 6 -perc_identity 100 -qcov_hsp_perc 100)
  • reads.fa: read sequences used in alignment in FASTA format, generated by Nanosim-H
  • outfile.txt: output summary file

analyse_gaf_total.py

Analyse the proportion of total read bases aligned by GraphAligner.

python analyse_gaf_total.py gaffile outfile.txt

Input/Output

  • gaffile: graphical alignment file produced by Graphaligner
  • outfile.txt: output summary file

analyse_nodes.py

Analyse node length and degree from a GFA file.

python analyse_nodes.py gfafile outpref

Input/Output

  • gfafile: GFA file produced from graph construction
  • outpref: output prefix for distribution and summary files

analyse_unitig.py

Analyse unitig frequencies from a Bifrost graph.

python analyse_unitig.py gfa_dist.txt colours.tsv output

Input/Output

  • gfa_dist.txt: Distribution file generated from analyse_nodes.py
  • colours.tsv: TSV file produced from Bifrost query by querying constituent unitigs against GFA itself
  • output: output file of results

bifrost_unitig_freq.R

R script for analysis of output files from analyse_untig.py, generating unitig frequency plot.

To use, specify input directory at #input directory, with files in .txt format.

check_ORF.py

This script contains two functions for analysing ORF calls made by a gene caller.

check_ORF_in_ref()

Checks presence of a called ORF in forward and reverse complements of a set of reference source sequences.

(within python REPL) check_ORF_in_ref(ref_fasta_for, ref_fasta_rev, query_fasta, outfasta)

Input/Output

  • ref_fasta_for: Multi-FASTA of reference source sequences (forward strand)
  • ref_fasta_rev: Multi-FASTA of reference source sequences (reverse strand)
  • query_fasta: ORF calls in FASTA format
  • outfasta: output FASTA containing ORFs not present in forward or reverse sequences

check_ref_in_query()

Checks presence of a known gene in longer called ORFs.

(within python REPL) check_ref_in_query(ref_fasta, query_fasta, outfile)

Input/Output

  • ref_fasta: Multi-FASTA of known genes
  • query_fasta: ORF calls in FASTA format
  • outfile: output FASTA containing known genes not found in any called ORFs

compare_gene_calls.py

Compares known genes against called ORFs by Prodigal or ggCaller in S. pneumoniae capsular biosynthetic loci (CBL). Prints recall and precision, and returns list of unmatched sequences.

python compare_gene_calls.py reference_genes gene_calls caller_type group

Input/Output

  • reference_genes: known genes in FASTA format
  • gene_calls: ORF calls by Prodigal or ggCaller in FASTA format
  • caller_type: specify which gene caller used (ggCaller = ggc, Prodigal = prod)
  • group: CBL group used in comparison.

gfa_to_fasta.py

Creates FASTA of unitigs from GFA files

(within python REPL) gfa_to_fasta(gfafile)

Input/Output

  • gfafile: GFA file generated from graph construction
  • output is FASTA with same file prefix as gfafile

panaroo_gene_freq.R

R script for analysis of RTAB file generated by Panaroo.

For usage, specify input directory at #input directory, with files in .RTAB format.