Fix inconsistent subtype function
Closed this issue · 3 comments
Sample SRR1958005 returns multiple subtype calls 2.1; 2.2
with all subtype calls being 2; 2.1; 2.2
. Subtypes 2.1
and 2.2
are not consistent with each other.
Testing bio_hansel -s enteritids
on the Enterobase genome assembly of SRR1958005 (46121.fasta
) produced the following results:
sample | scheme | subtype | all_subtypes | tiles_matching_subtype | are_subtypes_consistent | inconsistent_subtypes | n_tiles_matching_all | n_tiles_matching_all_total | n_tiles_matching_positive | n_tiles_matching_positive_total | n_tiles_matching_subtype | n_tiles_matching_subtype_total | file_path |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
46121 | enteritidis | 2.1; 2.2 | 2; 2.1; 2.2 | 134979-2.1; 1402592-2.1; 2483597-2.2 | TRUE | 652 | 652;652 | 14 | 13;36 | 3 | 2;25 | /home/peter/Downloads/46121.fasta |
Other samples which should have been are_subtypes_consistent == False
include:
- SRR3049064;
all_subtypes == "2; 2.1; 2.2.2.1.1"
- SRR1968431;
all_subtypes == "2; 2.1; 2.2.2.1.1"
- ERR1069299;
all_subtypes == "2; 2.1; 2.2.2"
- SRR1968272;
all_subtypes == "2; 2.1; 2.2.2"
See function here:
https://github.com/phac-nml/bio_hansel/blob/master/bio_hansel/utils.py#L80
Also, would be a good idea to add a test for that function.
When you mean inconsistent, is that because in 2.1
and 2.2
the ._ value is different?
Yes, so the inconsistent subtypes would be 2.1 and 2.2. I originally had the function try to find the highest resolution subtype (i.e. longest string of numbers) and use that as the reference to compare against, but in cases where the highest resolution subtypes are equal length, the function fails to report inconsistencies.
I think I found an issue.
https://github.com/phac-nml/bio_hansel/blob/ddee5c948b40722e2fe11282809fa4b05faf3292/bio_hansel/utils.py#L98
I believe it's supposed to freq >= 1
.
If both strings are same length, then >= 1 would add both to the incon_subtypes
list.