mdmparis/defense-finder

[BUG] miscalculation of alignment coverage

Closed this issue · 11 comments

Describe the bug
The value of alignment coverage is miscalculated.

To Reproduce

lcl|NZ_CALUAQ010000001.1_prot_WP_278902228.1_145 [locus_tag=QDJ88_RS00725] [protein=hypothetical protein] [protein_id=WP_278902228.1] [location=166201..167580] [gbkey=CDS]
MNNHPKFEEYKNFIVNNEAYKGLPYVRKKDCSVVWVATKQSEIGKKRIEWALNKAEKFDISKNDSSFYRKVMFIVHPTKE
KVCQICGESMSLDYIYPNKSFQKFLSKNYDYEPNVFDDIYDVSNHFKEEGKSNLEIIKILNKSIKLKGNFNDLTIYIKSV
IEACRNGEKKSLGPGAMSNFPDRYDGFHSYNRCCRQEFDKGRNPDNMKTYSKDRRAYENWSDGNIHAANKFMSSKYFTGL
SADHIGPISLGFIHDPRFMQPMSSGDNSSKRDRLEKSDIKKLIELESRYNLLPVSWFSEKIWKTLKADFLIDNNINLEYY
RNLMKININNYMSLLYIIITETGDLGKKFLEENFLKTKQEYFKYDYEFDEFGNIISKTFRNISDATKKEYSRSIRISFES
VIDFSEKENRNIKSSFNTNIAKNIDSLVYEINNCAPNKSIVNHFYKIVSAMQDDLLKEA
lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 [gene=dcm] [locus_tag=QDJ88_RS00730] [protein=DNA (cytosine-5-)-methyltransferase] [protein_id=WP_278902230.1] [location=complement(167588..170410)] [gbkey=CDS]
MELYNDLTKNTESYLKTLAMDIKRQQGIYYTDLELTDRIMDKVLEKISAEEINDINFLEPCVGSGNFVISYIKNAALKFK
LNSNEIRKLIKNIYVCDSDKSAIDLFIRNLKYFIKKEYKIDLTNDDLPNIGGSLIYNLQLREDNYIGIDKYFGDIKFDVI
VTNPPYKGLRAERRHYKTDDEYFNGRNYYNKIKEKSKHLHPLSGALNANIYRYFVEEIYTKYSKPDSIISILVPSSILTD
KSCYDLRNKIIDFNITDIIQIPENNKYIDASQAMTNIITDSNHINSSVSIEKLNNGRIENYLVPKKNIVDEAHNNSILIM
SSEDYKLLNKLKSISKVKDHKFIVNMRGELDITNNKDSITSKKTEYPLIRGRDIDRYCEVKYEKIKDYATKEFVNNSSKQ
RYVIKNRLACQQIVNMNKKTRIGFTLIKENTVLANSCNFIFIEDNDYGIDEYYALAILNSKYCDWYFKIFSSNNHVNNYE
IDLLPFPIGNSVQIQKISSLAKEQVLEFSNLRDNQINKIVTEIIDNFFGVENKINLNSTQIDELANKGLKEKNKVLSKKG
ILNDKQYTLSELDLEIIKSVPQGGNWKNIPDKTIEKSKRLMKIRETGGRTTLYGRIDYTNPSYTITTYFNRPGNGCYIHP
TQDRVLTTREGARIQCFPDDYYFYGNQRDILNQIGNAVPPLMGYLIAKKIKENLNVKKSLDLFSGAGGLLYGFKMAGVEH
VLANDIDRSACVTLKINNPEVNVLCDDVTNDYTKEIIIDTAIQNNVDIICGGPPCQGFSLAGFRKSDDPRNKLVLDFADI
IKNVEPKVFVFENVVGLLSYNKGETFNEIKKMFLALGYKLHAETLDFSDYGVPQRRRRVIIIGVRNNINIEPSKLFPDKI
TKNKKISVMETIGDLDTNINSCNMNSKFISLMKNEISYDYYLDSIKENCENEIGEQLSIF

Steps to reproduce the behavior:

  1. defense-finder run -o . -w 1 input_file

  2. The output is
    lcl|NZ_CALUAQ010000001.1_prot_WP_278902228.1_145 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 143 459 RM_Type_II__Type_II_REases 2.400e-11 40.200 0.5380.401 243 426
    lcl|NZ_CALUAQ010000001.1_prot_WP_278902228.1_145 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 143 459 RM_Type_II__Type_II_REases 1.100e-16 57.700 0.4690.344 78 235
    lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 144 940 DISARM_2__drmMII 8.700e-58 192.700 0.563 0.230699 914
    lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 144 940 Druantia_II__DruM 2.800e-57 191.200 0.525 0.210699 895
    lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 144 940 RM_Type_II__Type_II_MTases 1.700e-05 20.700 0.4100.169 541 699
    lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 144 940 RM_Type_II__Type_II_MTases 1.400e-07 27.000 0.5490.251 13 248
    lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 144 940 RM_Type_II__Type_II_MTases 6.200e-08 28.100 1.0460.422 11 407
    lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 144 940 RM_Type_II__Type_II_MTases 2.900e-15 52.800 0.5050.186 697 871
    lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 144 940 RM_Type_II__Type_II_MTases 5.600e-26 87.900 0.5490.196 700 883
    lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 144 940 RM_Type_II__Type_II_MTases 7.700e-28 94.100 0.5000.187 699 874
    lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 144 940 RM_Type_II__Type_II_MTases 4.500e-46 154.100 0.5380.227 698 910
    lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 144 940 RM_Type_II__Type_II_MTases 2.900e-48 161.300 0.5980.229 699 913
    lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 144 940 RM_Type_II__Type_II_MTases 2.900e-52 174.700 0.9210.647 90 697
    lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146 GCF_943914075.1_IASS5iDOpI_bin.1.MAG_translated_cds 144 940 RM_Type_II__Type_II_MTases 2.000e-61 204.900 0.6090.231 698 914

  3. The file I got is attached / contains ...

  4. I got the following error : ....

Expected behavior

  1. When we look at hit_profile_cov and hit_seq_cov, we have a very high and low value, so I don't know if it's express between 0 and 1 or 0 and 100.
  2. When I calculate (hit_end_match - hit_begin_match)/hit_sequence_length I did not find either hit_profile_cov or hit_seq_cov.

Maybe I musunderstand something.

Screenshots

If applicable, add screenshots to help explain your problem.

Please complete the following information :

OS:

  • [CentOS version 7 ] Linux

*** Version of your the defenseFinder software and Models***

mdmparis-defense-finder==1.2.2
macsyfinder==2.1.1
models 1.2.4

Hi !
Maybe I figured it out the problem. In MSF there is one file for one HMM profile, in DF there are multiple profile in the same file. When MSF/DF look at the profile coverage it use the first length of the first profile. So if the first profile length is short all the assignment of annotation are false.
This issue is a big problem because it affect most of profile and systems like RM.

I suspect that the problem can also affect the GA cutoff, but I'm not sure about this one.

Hello,

Thanks for the report. I tried with your example attached.

test.txt

I got for the sequence coverage the correct values (see last columns):

 awk 'NR==1{print $2, $3, $14, $17, $18, $19, $20, "seq_cov" }NR>1{print $2, $3, $14, $17, $18, $19, $20, ($20-$19+1)/$14}' FS="\t" OFS="\t" test_defense_finder_genes.tsv
hit_id	gene_name	hit_seq_len	hit_profile_cov	hit_seq_cov	hit_begin_match	hit_end_match	seq_cov
lcl|NZ_CALUAQ010000001.1_prot_WP_278902228.1_145	RM_Type_II__Type_II_REases	459	0.469	0.344	78	235	0.344227
lcl|NZ_CALUAQ010000001.1_prot_WP_278902230.1_146	RM_Type_II__Type_II_MTases	940	0.609	0.231	698	914	0.230851

For the profile coverage (as for the sequence coverage), they are computed by macsyfinder, but I can confirm that there is an issue with the coverage reported when inspecting the tmp files.

For instance, when looking at defense-finder-tmp/RM/hmmer_results/RM_Type_II__Type_II_REases.search_hmm.out

we have :

Query:       RM_Type_II__Type_II_REases___Type_II_REase13  [M=370]
Scores for complete sequences (score includes all domains):
   --- full sequence ---   --- best 1 domain ---    -#dom-
    E-value  score  bias    E-value  score  bias    exp  N  Sequence                                         Description
    ------- ------ -----    ------- ------ -----   ---- --  --------                                         -----------
    7.2e-30   91.5  18.4    1.4e-19   57.7   0.6    2.4  2  lcl|NZ_CALUAQ010000001.1_prot_WP_278902228.1_145  [locus_tag=QDJ88_RS00725] [prote


Domain annotation for each sequence (and alignments):
>> lcl|NZ_CALUAQ010000001.1_prot_WP_278902228.1_145  [locus_tag=QDJ88_RS00725] [protein=hypothetical protein] [protein_i
   #    score  bias  c-Evalue  i-Evalue hmmfrom  hmm to    alifrom  ali to    envfrom  env to     acc
 ---   ------ ----- --------- --------- ------- -------    ------- -------    ------- -------    ----
   1 !   57.7   0.6     7e-20   1.4e-19       7     129 ..      78     235 ..      72     242 .. 0.74
   2 !   40.2   9.3   1.5e-14     3e-14     190     330 ..     243     426 ..     236     447 .. 0.82

Macsyfinder takes the best domain to compute coverage, so it would be : (129-7+1)/370 = 0.332

But the extract by MacSyfinder shows :

➜  issue68 more defense-finder-tmp/RM/hmmer_results/RM_Type_II__Type_II_REases.res_hmm_extract
# gene: RM_Type_II__Type_II_REases extract from /home/jean/Nextcloud/Work/DefenseFinder/issues/issue68/defense-finder-tmp/RM/hmmer_results/RM_Type_II__Type_II_REases.search_hmm.out hmm output
# profile length= 262
# i_evalue threshold= 0.001
# coverage threshold= 0.400
# hit_id replicon_name position_hit hit_sequence_length gene_name gene_system i_eval score profile_coverage sequence_coverage begin end
lcl|NZ_CALUAQ010000001.1_prot_WP_278902228.1_145	test	1	459	RM_Type_II__Type_II_REases	3.000e-14	40.200	0.538	0.401	243	426
lcl|NZ_CALUAQ010000001.1_prot_WP_278902228.1_145	test	1	459	RM_Type_II__Type_II_REases	1.400e-19	57.700	0.469	0.344	78	235

so, here we would expect 0.469 as profile coverage. We can find it using the size of the first profile of the file : (129-7+1)/262 = 0.469

So I guess we can reopen the macsyfinder issue !

as for the cut_ga, it's handled by hmmer, so it should be fine. Let us know if you see something weird there.

Hi!
Maybe my example was not good enough. But I agree with what reply @bneron on the MSF issue, I found out myself that you can't use multiple profile in one file. The owncooked HMM manager of MSF does not allow this.
I could calculate that this error can impact the prediction with a flase rate of 1 to 10% and that it concern most of the RM prediction.
My suggestion is to update the database, associate one file to one profile. The other option is to really use pyhmmer instead of MSF HMM manager and use MSF detection as an API.

Yes, it affects only RM + Retron. We will release in the next version, one HMM per file.

For pyhmmer, we could but we definitely don't have the time for that, but if you want to propose a PR with that, you're welcome !

Thanks anyway for finding the bug !

(and we're ok that the sequence coverage is good?)

Hi,
I did not find any bug on the sequence coverage but it could be a good thing to add some unitary test to check it out there is no problem.

The problem with several profile or one per file does not come from using HMMER or pyhmmer. But at the beginning of the project, it has been decided to have one profile per gene and this file must be named according to the gene. So MSF rely on this at various steps of the program.
If we change this behavior we have to fix that in lot of parts of MSF. Furthermore If we switch to allow several profiles per file MSF should rely on the NAME field in each profile. But lot of provided models have one profile per file but the field NAME in the profile does not match the gene name.
So if we decide to support several profiles per file this will have huge impact on code and distributed models with backwards compatibilities problems.
It can be discussed for the future of MSF but it's not just a bug fix.

@bneron
Maybe we could open a discussion issue on MSF to talk about it?
Anyway, I will quickly give my opinion. A profile should be considered as itself with its property. So, the field should be the primary choice and not the name of the file. This way, if you prefer to use a profile name or accession ID as a file name, it doesn't matter. Moreover, if multiple profiles are allowed in one file in the future, it will not have any impact because the profile will be considered as itself.
I would be pleased to develop my point if you wish, but it does not belong in this issue.
Thanks and regards.

This issue has been inactive for 60 days and is now marked as stale. It will be closed in 7 days without further activity.

This is fixed in #73 and with models 2.0