Mykrobe-tools/mykrobe

make-probes genbank file key error

arnoldbain opened this issue ยท 7 comments

Mykrobe make-probes doesnot seem to like protein locus tag (e.g. Rv1979c) but ok with the name tags like rpoB; below is the error;
mykrobe variants make-probes -t gene.txt -g H37Rv.gbk rH37Rv.fasta > probes.gene.fa

[mykrobe 2021-06-11T13:44:34 WARNING] Could not connect to database. Continuing without using genetic backgrounds
Traceback (most recent call last):
File "/home/arnold/miniconda3/bin/mykrobe", line 11, in
sys.exit(main())
File "/home/arnold/miniconda3/lib/python3.8/site-packages/mykrobe/cli.py", line 8, in main
args.func(parser, args)
File "/home/arnold/miniconda3/lib/python3.8/site-packages/mykrobe/parser.py", line 55, in run_subtool
run(parser, args)
File "/home/arnold/miniconda3/lib/python3.8/site-packages/mykrobe/cmds/makeprobes.py", line 69, in run
for var_name in aa2dna.get_variant_names(
File "/home/arnold/miniconda3/lib/python3.8/site-packages/mykrobe/annotation/genes/models.py", line 181, in get_variant_names
gene = self.get_gene(gene)
File "/home/arnold/miniconda3/lib/python3.8/site-packages/mykrobe/annotation/genes/models.py", line 237, in get_gene
return self.genbank[gene]
KeyError: 'Rv1979c'

The problem is that Rv1979c is not in the genbank file we use. It was added to the annotation recently so we need to update the genbank file.

In our email correspondence I asked @arnoldbain to download the latest genbank file (which has this locus) and still get's the same issue. This locus does not have a "gene" name and the way mykrobe currently indexes genbank files relies on there being a gene name. For instance, this is the entry for the locus @arnoldbain is looking at

     gene            complement(2221719..2223164)
                     /locus_tag="Rv1979c"
                     /db_xref="GeneID:885819"

So it seems we need update our genbank indexing method @martinghunt

As a work-around in the meantime @arnoldbain you could try adding a gene name to the annotation for Rv1979c in the latest genbank file. You should find it on line 41310 and 41313 in the latest genbank file.

So it looks something like

     gene            complement(2221719..2223164)
                     /locus_tag="Rv1979c"
                     /gene="Rv1979c"
                     /db_xref="GeneID:885819"
     CDS             complement(2221719..2223164)
                     /locus_tag="Rv1979c"
                     /gene="Rv1979c"
                     /inference="protein motif:PROSITE:PS00599"

I expect this is non-trivial to fix, and needs careful thought.

If we allow more than one way to refer to the same variant (eg rpoB == Rv0667), that looks like it would break other pieces of the code (background variants, detecting two variants within a kmer of each other ... etc). On the other hand, we currently use gene identifiers, and changing from gene name to exclusively using locus tag would break lots of other things - and also I expect result in the output reporting eg Rv0667 instead of the much more meaningful rpoB.

For now, I recommend using @mbhall88's workaround.

Working fine with @mbhall88 suggestion, thanks guys

I expect this is non-trivial to fix, and needs careful thought.

If we allow more than one way to refer to the same variant (eg rpoB == Rv0667), that looks like it would break other pieces of the code (background variants, detecting two variants within a kmer of each other ... etc). On the other hand, we currently use gene identifiers, and changing from gene name to exclusively using locus tag would break lots of other things - and also I expect result in the output reporting eg Rv0667 instead of the much more meaningful rpoB.

We could just use a sort of hierarchy when we first try and use the gene tag and of that isn't present try and use a locus tag and if that isn't present, move on? That way we would always use rpoB instead of Rv0667

I expect this is non-trivial to fix, and needs careful thought.
If we allow more than one way to refer to the same variant (eg rpoB == Rv0667), that looks like it would break other pieces of the code (background variants, detecting two variants within a kmer of each other ... etc). On the other hand, we currently use gene identifiers, and changing from gene name to exclusively using locus tag would break lots of other things - and also I expect result in the output reporting eg Rv0667 instead of the much more meaningful rpoB.

We could just use a sort of hierarchy when we first try and use the gene tag and of that isn't present try and use a locus tag and if that isn't present, move on? That way we would always use rpoB instead of Rv0667

Need to handle when someone provides two variants in the same gene, but one of them specified by the gene name and the other specified by the locus tag. There are likely more edge cases.

Yes, but that seems to be a corner case we could very easily say we don't handle yet (or may never handle as it seems insanse)