phac-nml/biohansel

Fix inconsistent subtype function

Closed this issue · 3 comments

Sample SRR1958005 returns multiple subtype calls 2.1; 2.2 with all subtype calls being 2; 2.1; 2.2. Subtypes 2.1 and 2.2 are not consistent with each other.

Testing bio_hansel -s enteritids on the Enterobase genome assembly of SRR1958005 (46121.fasta) produced the following results:

sample scheme subtype all_subtypes tiles_matching_subtype are_subtypes_consistent inconsistent_subtypes n_tiles_matching_all n_tiles_matching_all_total n_tiles_matching_positive n_tiles_matching_positive_total n_tiles_matching_subtype n_tiles_matching_subtype_total file_path
46121 enteritidis 2.1; 2.2 2; 2.1; 2.2 134979-2.1; 1402592-2.1; 2483597-2.2 TRUE   652 652;652 14 13;36 3 2;25 /home/peter/Downloads/46121.fasta

Other samples which should have been are_subtypes_consistent == False include:

  • SRR3049064; all_subtypes == "2; 2.1; 2.2.2.1.1"
  • SRR1968431; all_subtypes == "2; 2.1; 2.2.2.1.1"
  • ERR1069299; all_subtypes == "2; 2.1; 2.2.2"
  • SRR1968272; all_subtypes == "2; 2.1; 2.2.2"

See function here:
https://github.com/phac-nml/bio_hansel/blob/master/bio_hansel/utils.py#L80

Also, would be a good idea to add a test for that function.

When you mean inconsistent, is that because in 2.1 and 2.2 the ._ value is different?

Yes, so the inconsistent subtypes would be 2.1 and 2.2. I originally had the function try to find the highest resolution subtype (i.e. longest string of numbers) and use that as the reference to compare against, but in cases where the highest resolution subtypes are equal length, the function fails to report inconsistencies.

I think I found an issue.
https://github.com/phac-nml/bio_hansel/blob/ddee5c948b40722e2fe11282809fa4b05faf3292/bio_hansel/utils.py#L98
I believe it's supposed to freq >= 1.

If both strings are same length, then >= 1 would add both to the incon_subtypes list.