bebop/poly

Failed to parse Saccharomyces cerevisiae S288C chromosome IX

Koeng101 opened this issue · 8 comments

the <155222 is confusing parseLocation.

2023/05/29 10:24:00 Failed to parse ix.gb with err: strconv.Atoi: parsing "<155222": invalid syntax

What does it even mean tho

https://www.ncbi.nlm.nih.gov/nuccore/NC_001141.2

     gene            <155222..>155765
                     /gene="COX5B"
                     /locus_tag="YIL111W"
                     /db_xref="GeneID:854695"
     mRNA            join(<155222,155311..>155765)
                     /gene="COX5B"
                     /locus_tag="YIL111W"
                     /product="cytochrome c oxidase subunit Vb"
                     /transcript_id="NM_001179459.1"
                     /db_xref="GeneID:854695"
     CDS             join(155222,155311..155765)
                     /gene="COX5B"
                     /locus_tag="YIL111W"
                     /experiment="EXISTENCE:direct assay:GO:0005739
                     mitochondrion [PMID:16823961|PMID:24769239]"
                     /experiment="EXISTENCE:direct assay:GO:0005751
                     mitochondrial respiratory chain complex IV [PMID:2986105]"
                     /experiment="EXISTENCE:direct assay:GO:0006123
                     mitochondrial electron transport, cytochrome c to oxygen
                     [PMID:1331058]"
                     /experiment="EXISTENCE:mutant phenotype:GO:0004129
                     cytochrome-c oxidase activity [PMID:2986105]"
                     /experiment="EXISTENCE:mutant phenotype:GO:0050421 nitrite
                     reductase (NO-forming) activity [PMID:18388202]"
                     /note="Subunit Vb of cytochrome c oxidase; cytochrome c
                     oxidase is the terminal member of the mitochondrial inner
                     membrane electron transport chain; Cox5Bp is predominantly
                     expressed during anaerobic growth while its isoform Va
                     (Cox5Ap) is expressed during aerobic growth; COX5B has a
                     paralog, COX5A, that arose from the whole genome
                     duplication"
                     /codon_start=1
                     /product="cytochrome c oxidase subunit Vb"
                     /protein_id="NP_012155.1"
                     /db_xref="GeneID:854695"
                     /db_xref="SGD:S000001373"
                     /translation="MLRTSLTKGARLTGTRFVQTKALSKATLTDLPERWENMPNLEQK
                     EIADNLTERQKLPWKTLNNEEIKAAWYISYGEWGPRRPVHGKGDVAFITKGVFLGLGI
                     SFGLFGLVRLLANPETPKTMNREWQLKSDEYLKSKNANPWGGYSQVQSK"

What does it even mean tho

It looks like the string <155222 was tried to be parsed as an integer, which it isn't since it contains the string "<", which is non-numerical. Looks like an off-by-one error when acquiring the integer string.

I know what the code means, but it is pretty unclear what it biologically means. All 3 of those are referring to the same gene/mRNA/CDS... but each one uses a different location string - and it looks like the gene at least is lossy.

<155222..>155765 doesn't make sense because it isn't say where the gene actually does start (like with join(<155222,155311..>155765), which basically says there is an intron from 155222 to 155311, and then from 155311 to 155765 there is a gene). The better way to write that would be join(155222,155311..155765), but semantically I think they mean the same thing.

Status update on this? Does it still need fixing?

I don't think it has been fixed. It does need fixing

I think the difficult part here is parsing out the join properly - without keeping a map of locus_tags, I'm not sure you can even parse <155222..>155765 properly, at all. It doesn't contain all the information necessary get the sequence out. We could also just accept that it is fucked up, and not try to fix it all. I kinda like that solution. Here is what snapgene displays:

Screen Shot 2023-09-15 at 2 07 58 PM

I personally think this is a fine solution so long as we note it somewhere. We should probably have a note somewhere in the file of all the location exception cases we find.

Should be fixed in #394 @Koeng101?

Probably not. I think the time to fix this would be after the merge of ioToBio.

This issue has had no activity in the past 2 months. Marking as stale.

This will be fixed once #437 is merged as a part of #434 .

To clarify, the < and > syntax indicate that the sequence is unbounded, i.e. <155222..>155765 indicates the sequence starts before base 155222 and ends after base 155765.