Latest JHU RefSeqv110 + Liftoff v5 gff3 does not pass gff3 validation
nrockweiler opened this issue · 3 comments
nrockweiler commented
Hi,
I believe there a fair number of gff3 validation issues with the recent update of JHU RefSeqv110 + Liftoff v5 in in commit dafcf67.
I've been using the GenomeTools gff3validator tool to find these issues. Below is a summary of the issues:
- 1115 records have an odd
The
key in the attributes column, e.g.,
$ grep -n -m 1 "The=" chm13v2.0_RefSeq_Liftoff_v5.gff3
80912:chr1 Liftoff CDS 25137221 25137356 . + 0 Parent=NM_001282867.1;db_xref=GeneID:6007;exception=annotated by transcript or proteomic data;gbkey=CDS;gene=RHD;inference=similar to AA sequence (same species):RefSeq:NP_001269796.1;note=isoform 3 is encoded by transcript variant 3;The=RefSeq protein has 1 substitution compared to this genomic sequence;product=blood group Rh(D) polypeptide isoform 3;protein_id=NP_001269796.1;exon_number=4;extra_copy_number=0
MIR3690_1
is a PAR gene and is on bothchrX
andchrY
. To follow the convention for other PAR genes, I think the copy onchrX
should be renamedMIR3690
$ grep -w "ID=MIR3690_1" chm13v2.0_RefSeq_Liftoff_v5.gff3 | cut -f 1
chrX
chrY
- There is more than 1 ID element on line 3999636 (the IDs are
NM_001320962.1
andTSPY10P
):
$ grep -n -w "ID=NM_001320962.1;ID=TSPY10P" chm13v2.0_RefSeq_Liftoff_v5.gff3
3999636:chrY Liftoff transcript 9795914 9798710 . + . ID=NM_001320962.1;ID=TSPY10P;Dbxref=GeneID:100289087%2CGenbank:NM_001320962.1%2CHGNC:HGNC:37473;Name=NM_001320962.1;gbkey=mRNA;gene=TSPY10P;product=testis specific protein Y-linked 10%252C transcript variant 2;transcript_id=NM_001320962.1;matches_ref_protein=False;valid_ORF=False;inframe_stop_codon=True;extra_copy_number=0
- 80 records have a malformed key-value pair in the attributes column; the "key" is called
IDNM*
and there is no value. I think this is supposed to beID=NM*
, e.g.:
$ grep -m 1 -P "\tIDNM" chm13v2.0_RefSeq_Liftoff_v5.gff3
chrY Liftoff exon 9795914 9796445 . + . IDNM_001320962.1-1;ID=NM_001320962.1;Dbxref=GeneID:100289087%2CGenbank:NM_001320962.1%2CHGNC:HGNC:37473;gbkey=mRNA;gene=TSPY10P;product=testis specific protein Y-linked 10%252C transcript variant 2;transcript_id=NM_001320962.1;extra_copy_number=0
- While it didn't come up as a validation issue, I saw a lot of text where I thought it would be ascii characters, but it looked like maybe hex encodings, e.g.,
GeneID:100289087%2C
,testis specific protein Y-linked 10%252C transcript variant
etc. Maybe this has something to do with the mention ofcorrect[ing the] special character issues from the original file
in theREADME
?
Thank you!
Nicole
diekhans commented
The UCSC browser GFF3 parse can't parse this either; it is invalid.
arangrhie commented
Hello @nrockweiler, thanks for reporting this. We fixed all formatting issues and updated to v5.1.
@diekhans confirmed the updated v5.1 passing both the UCSC browser GFF3 parse and GenomeTools gff3validator.
Let us know in case there are any other issues!
Best,
Arang
nrockweiler commented
Wonderful! Thank you so much.
…On Thu, Jul 6, 2023, 2:05 PM Arang Rhie ***@***.***> wrote:
Hello @nrockweiler <https://github.com/nrockweiler>, thanks for reporting
this. We fixed all formatting issues and updated to v5.1
<https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/annotation/chm13v2.0_RefSeq_Liftoff_v5.1.gff3.gz>
.
@diekhans <https://github.com/diekhans> confirmed the updated v5.1
passing both the UCSC browser GFF3 parse and GenomeTools gff3validator.
Let us know in case there are any other issues!
Best,
Arang
—
Reply to this email directly, view it on GitHub
<#82 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACAJWBJL3HPSY4MZ3CRQF6LXO345PANCNFSM6AAAAAAX77IQVE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>