Failed to parse Saccharomyces cerevisiae S288C chromosome IX
Koeng101 opened this issue · 8 comments
the <155222
is confusing parseLocation.
2023/05/29 10:24:00 Failed to parse ix.gb with err: strconv.Atoi: parsing "<155222": invalid syntax
What does it even mean tho
https://www.ncbi.nlm.nih.gov/nuccore/NC_001141.2
gene <155222..>155765
/gene="COX5B"
/locus_tag="YIL111W"
/db_xref="GeneID:854695"
mRNA join(<155222,155311..>155765)
/gene="COX5B"
/locus_tag="YIL111W"
/product="cytochrome c oxidase subunit Vb"
/transcript_id="NM_001179459.1"
/db_xref="GeneID:854695"
CDS join(155222,155311..155765)
/gene="COX5B"
/locus_tag="YIL111W"
/experiment="EXISTENCE:direct assay:GO:0005739
mitochondrion [PMID:16823961|PMID:24769239]"
/experiment="EXISTENCE:direct assay:GO:0005751
mitochondrial respiratory chain complex IV [PMID:2986105]"
/experiment="EXISTENCE:direct assay:GO:0006123
mitochondrial electron transport, cytochrome c to oxygen
[PMID:1331058]"
/experiment="EXISTENCE:mutant phenotype:GO:0004129
cytochrome-c oxidase activity [PMID:2986105]"
/experiment="EXISTENCE:mutant phenotype:GO:0050421 nitrite
reductase (NO-forming) activity [PMID:18388202]"
/note="Subunit Vb of cytochrome c oxidase; cytochrome c
oxidase is the terminal member of the mitochondrial inner
membrane electron transport chain; Cox5Bp is predominantly
expressed during anaerobic growth while its isoform Va
(Cox5Ap) is expressed during aerobic growth; COX5B has a
paralog, COX5A, that arose from the whole genome
duplication"
/codon_start=1
/product="cytochrome c oxidase subunit Vb"
/protein_id="NP_012155.1"
/db_xref="GeneID:854695"
/db_xref="SGD:S000001373"
/translation="MLRTSLTKGARLTGTRFVQTKALSKATLTDLPERWENMPNLEQK
EIADNLTERQKLPWKTLNNEEIKAAWYISYGEWGPRRPVHGKGDVAFITKGVFLGLGI
SFGLFGLVRLLANPETPKTMNREWQLKSDEYLKSKNANPWGGYSQVQSK"
What does it even mean tho
It looks like the string <155222
was tried to be parsed as an integer, which it isn't since it contains the string "<", which is non-numerical. Looks like an off-by-one error when acquiring the integer string.
I know what the code means, but it is pretty unclear what it biologically means. All 3 of those are referring to the same gene/mRNA/CDS... but each one uses a different location string - and it looks like the gene at least is lossy.
<155222..>155765
doesn't make sense because it isn't say where the gene actually does start (like with join(<155222,155311..>155765)
, which basically says there is an intron from 155222 to 155311, and then from 155311 to 155765 there is a gene). The better way to write that would be join(155222,155311..155765)
, but semantically I think they mean the same thing.
Status update on this? Does it still need fixing?
I don't think it has been fixed. It does need fixing
I think the difficult part here is parsing out the join properly - without keeping a map of locus_tags, I'm not sure you can even parse <155222..>155765
properly, at all. It doesn't contain all the information necessary get the sequence out. We could also just accept that it is fucked up, and not try to fix it all. I kinda like that solution. Here is what snapgene displays:
I personally think this is a fine solution so long as we note it somewhere. We should probably have a note somewhere in the file of all the location exception cases we find.
Probably not. I think the time to fix this would be after the merge of ioToBio.
This issue has had no activity in the past 2 months. Marking as stale
.