/match_chewBBACA_to_Enterobase

Scripts to match chewBBACA output to Enterobase cgMLST and hierarchical CCs

Primary LanguagePythonMIT LicenseMIT

Using chewBBACA typing output to find corresponding Enterobase cgMLST and hierCC

https://github.com/boasvdp/match_chewBBACA_to_Enterobase/actions/workflows/match_chewBBACA_to_Enterobase.yml/badge.svg

Introduction

ChewBBACA is a "comprehensive pipeline including a set of functions for the creation and validation of whole genome and core genome MultiLocus Sequence Typing (wg/cgMLST) schemas". The Enterobase cgMLST + hierCC scheme can be downloaded through the PubMLST API and used in chewBBACA. However, to match the chewBBACA output to the Enterobase typing scheme, some extra scripts are needed which are provided (experimentally) here.

Methods

The main script loads chewBBACA typing output (which may contain allelic profiles of multiple isolates) and matches these against the Enterobase cgMLST allelic profiles. Because there are >160.000 cgMLSTs with allelic profiles, the profiles file is read in chunks (fixed at 10,000 lines now). The scripts takes approximately 4 per isolate.

Results

The script writes its output to a csv file, containing the columns:

  • Isolate name
  • Number of matching alleles, number of loci in cgMLST typing scheme and the difference between those, which represents the maximum number of mismatches
  • cgMLST
  • Various HierCC levels (HC0, HC2, HC5, HC10, HC20, HC50, HC100, HC200, HC400, HC1100, HC1500, HC2000, HC2350)

What I learned or plan to learn

  • Writing numpy style docstrings
  • Writing a simple decorator to find out which functions were slowest
  • Using GitHub Actions
  • Some additional experience with writing efficient functions
  • How to handle PubMLST API efficiently
  • Unit testing

To do

  • Add scripts to download Enterobase cgMLST alleles through the PubMLST API
  • Check whether there is an appropriate way to update profiles and ST to hierCC table from Enterobase (keeping in mind fair usage)
  • Implement pytest or other testing framework
  • Improve the correspondence between Enterobase cgMLST typing scheme and chewBBACA. Currently, a lot of alleles are removed in the PrepExternalSchema step of chewBBACA.
  • Improve the speed of comparisons. This is currently done by comparing chunks of the profiles file against the isolate's allelic profile, but this is by far the slowest step in the script.