Multiple annotations per kmer, prevents summarization
jacordova opened this issue · 6 comments
Hello,
Thanks for this great piece of software!
I've run into an issue where the annotate_hits_pyseer
script adds multiple annotations to some kmers. From a quick glance, it seems that the multiple annotations are from the same reference. See examples below:
TTTTCTTTTATCA [...] CFT073_Chromosome:2459877-2459889;ABR-0078744;;ABR-0078745,CFT073_Chromosome:1488731-1488743;ABR-0077763;;ABR-0077765
TTACTCAATAAT [...] CFT073_Chromosome:2886042-2886053;ABR-0079164;ABR-0079164;ABR-0079164,CFT073_Chromosome:4541738-4541749;ABR-0080970;ABR-0080970;ABR-0080970,CFT073_Chromosome:4234290-4234301;ABR-0080626;ABR-0080626;ABR-0080626
As a result, this causes the summarise_annotations.py
script to error out:
File "/jcordova2/Pyseer/pyseer-master/scripts/summarise_annotations.py", line 70, in <module> (position, down, inside, up) = annotation.split(";") ValueError: too many values to unpack (expected 4)
After I remove the additional annotations manually, the summarise_annotations.py
script is able to proceed. Do you have any recommendations on limiting to a single annotation? Or potentially forcing the multiple annotations to move to a new line?
Thanks!
This sounds more like a bug in the summarise annotations script we should fix. Will try and get round to that in the coming weeks
Thanks, I appreciate it!
@jacordova I am just looking into this and I'm not seeing why either of these two inputs would trigger this error. Could you please send/attach here the offending file so I can try to reproduce the error.
@johnlees Sure, the files are attached below.
This file is the input to annotate_hits_pyseer
significant_anaerobic_kmers_1E-05.txt
This file is the input to summarise_annotations.py
sig_kmer_anaerobic_annotation_1E-05.txt
The line this is failing on is:
TTTTCCCTCGATCTCTCCCTGGTCAGCCAGCAGTTTCTCTCCCTTGAATTTGCGCAGGTGCTGGACAAAATGGCCTACCTGACGATATGGCAGGGCGATG\t4.90E-01\t5.05E-02\t4.37E-06\t2.72E-03\t5.59E-04\t4.45E-01\tDEC10Auw,DEC10Buw,DEC10Cuw,DEC10Duw,DEC10Euw,DEC10Fuw,DEC11Auw,DEC11Buw,DEC11Cuw,DEC11Duw,DEC11Euw,DEC12Auw,DEC12Buw,DEC12Cuw,DEC12Duw,DEC12Euw,DEC13Auw,DEC13Buw,DEC13Cuw,DEC13Duw,DEC13Euw,DEC14Auw,DEC14Cuw,DEC14Duw,DEC14Euw,DEC15Auw,DEC15Buw,DEC15Cuw,DEC15Duw,DEC15Euw,DEC6Cuw,DEC6Duw,DEC6Euw,DEC7Auw,DEC7Buw,DEC7Cuw,DEC7Duw,DEC7Euw,DEC8Auw,DEC8Buw,DEC8Cuw,DEC8Duw,DEC8Euw,DEC9Auw,DEC9Buw,DEC9Cuw,DEC9Duw,DEC9Euw\tCFT073uw,DEC14Buw,DEC1Auw,DEC1Buw,DEC1Cuw,DEC1Duw,DEC1Euw,DEC2Auw,DEC2Buw,DEC2Cuw,DEC2Duw,DEC2Euw,DEC3Auw,DEC3Buw,DEC3Cuw,DEC3Duw,DEC3ELuw,DEC3ESuw,DEC3Fuw,DEC4Auw,DEC4Buw,DEC4Cuw,DEC4Duw,DEC4Euw,DEC4Fuw,DEC5Auw,DEC5Buw,DEC5Cuw,DEC5Duw,DEC6Auw,DEC6Buw,EDL933uw,MG1655uw,O157H7_82uw,O157H7_83uw,O157H7_84uw,O157H7_85uw,O157H7_86uw,O157H7_87uw,O157H7_88uw,O157H7_89uw,O157H7_90uw,O157H7_91uw,O157H7_92uw,O157H7_93uw,O157H7_94uw,O157H7_95uw,SAKAIuw,UTI89uw,W3110uw\tEco_Sakai_Chromosome;:675190-675289;;;
The problem is Eco_Sakai_Chromosome;
. Is there a semi-colon where these chromosome names are defined e.g. in the fasta header?
Agh looks like that was it! It runs great now. Thank you