pha4ge/hAMRonization

[BUG] - RGI bwt gene_mapping

Opened this issue · 3 comments

Describe the bug

I believe there may be a format incompatibility with RGI bwt output (--include_wildcards). It looks like when multiple reference gene lengths are identified, it throws off parsing of the file.

Input
hamronize rgi 11J_S19.gene_mapping_data.txt --analysis_software_version 6.0.1 --reference_database_version 3.2.8 --input_file_name rgi --format tsv

Input file

ARO Term ARO Accession Reference Model Type Reference DB Alleles with Mapped Reads Reference Allele(s) Identity to CARD Reference Protein (%) Resistomes & Variants: Observed in Genome(s) Resistomes & Variants: Observed in Plasmid(s) Resistomes & Variants: Observed Pathogen(s) Completely Mapped Reads Mapped Reads with Flanking Sequence All Mapped Reads Average Percent Coverage Average Length Coverage (bp) Average MAPQ (Completely Mapped Reads) Number of Mapped Baits Number of Mapped Baits with Reads Average Number of reads per Bait Number of reads per Bait Coefficient of Variation (%) Number of reads mapping to baits and mapping to complete gene Number of reads mapping to baits and mapping to complete gene (%) Mate Pair Linkage (# reads) Reference Length
Bifidobacterium adolescentis rpoB mutants conferring resistance to rifampicin 3004480 protein homolog model CARD; Resistomes & Variants 15 91.39 - 100.0 YES no data Bifidobacterium adolescentis; Bifidobacterium animalis; Bifidobacterium bifidum; Bifidobacterium longum; Bifidobacterium thermophilum; Gardnerella vaginalis 668 0 668 34.84 1249.47 116.48 0 0 0 0 N/A N/A 3561; 3561; 3561; 3564; 3684; 3633; 3633; 3564; 3564; 3564; 3564; 3570; 3570; 3570; 3567 rifamycin-resistant beta-subunit of RNA polymerase (rpoB) rifamycin antibiotic antibiotic target alteration; antibiotic target replacement

Error log
ValueError: Expected reference_gene_length to be <class 'int'>, got '3561; 3564; 3570'

hAMRonization Version
v1.14

Desktop (please complete the following information):
Ubuntu 20.04

Additional context
latest pull from conda.

Reference Allele(s) Identity to CARD Reference Protein (%)

also causes an issue as it is reported as a range in some instances: ValueError: Expected sequence_identity to be <class 'float'>, got '92.82 - 100.0'

I changed the expected values in hAMRonizedResult.py to str instead of int and float respectively. Not sure if this breaks anything downstream yet but it allows for execution on the output so far.

Hi @MicroSeq, thanks for flagging this and apologies about the delay been out of country. That parser was actually developed by the lead RGI-bwt so hopefully its just got a little out of date re: the type checking.

@raphenya could you have a look and let me know which numerical fields in RGI-bwt outputs are ever going to be multiple numbers in a list or ranges?

@raphenya just a bump to see if you've had a change to look at this and can let me know which RGI-bwt fields can be ranges vs numerical values.