Latest JHU RefSeqv110 + Liftoff v5 gff3 does not pass gff3 validation

Question

Latest JHU RefSeqv110 + Liftoff v5 gff3 does not pass gff3 validation

nrockweiler opened this issue a year ago · 3 comments

Hi,

I believe there a fair number of gff3 validation issues with the recent update of JHU RefSeqv110 + Liftoff v5 in in commit dafcf67.

I've been using the GenomeTools gff3validator tool to find these issues. Below is a summary of the issues:

1115 records have an odd The key in the attributes column, e.g.,

$ grep -n -m 1 "The=" chm13v2.0_RefSeq_Liftoff_v5.gff3
80912:chr1	Liftoff	CDS	25137221	25137356	.	+	0	Parent=NM_001282867.1;db_xref=GeneID:6007;exception=annotated by transcript or proteomic data;gbkey=CDS;gene=RHD;inference=similar to AA sequence (same species):RefSeq:NP_001269796.1;note=isoform 3 is encoded by transcript variant 3;The=RefSeq protein has 1 substitution compared to this genomic sequence;product=blood group Rh(D) polypeptide isoform 3;protein_id=NP_001269796.1;exon_number=4;extra_copy_number=0

MIR3690_1 is a PAR gene and is on both chrX and chrY. To follow the convention for other PAR genes, I think the copy on chrX should be renamed MIR3690

$ grep -w "ID=MIR3690_1" chm13v2.0_RefSeq_Liftoff_v5.gff3 | cut -f 1
chrX
chrY

There is more than 1 ID element on line 3999636 (the IDs are NM_001320962.1 and TSPY10P):

$ grep -n -w "ID=NM_001320962.1;ID=TSPY10P" chm13v2.0_RefSeq_Liftoff_v5.gff3
3999636:chrY	Liftoff	transcript	9795914	9798710	.	+	.	ID=NM_001320962.1;ID=TSPY10P;Dbxref=GeneID:100289087%2CGenbank:NM_001320962.1%2CHGNC:HGNC:37473;Name=NM_001320962.1;gbkey=mRNA;gene=TSPY10P;product=testis specific protein Y-linked 10%252C transcript variant 2;transcript_id=NM_001320962.1;matches_ref_protein=False;valid_ORF=False;inframe_stop_codon=True;extra_copy_number=0

80 records have a malformed key-value pair in the attributes column; the "key" is called IDNM* and there is no value. I think this is supposed to be ID=NM*, e.g.:

$ grep -m 1 -P "\tIDNM" chm13v2.0_RefSeq_Liftoff_v5.gff3
chrY	Liftoff	exon	9795914	9796445	.	+	.	IDNM_001320962.1-1;ID=NM_001320962.1;Dbxref=GeneID:100289087%2CGenbank:NM_001320962.1%2CHGNC:HGNC:37473;gbkey=mRNA;gene=TSPY10P;product=testis specific protein Y-linked 10%252C transcript variant 2;transcript_id=NM_001320962.1;extra_copy_number=0

While it didn't come up as a validation issue, I saw a lot of text where I thought it would be ascii characters, but it looked like maybe hex encodings, e.g., GeneID:100289087%2C, testis specific protein Y-linked 10%252C transcript variant etc. Maybe this has something to do with the mention of correct[ing the] special character issues from the original file in the README?

Thank you!
Nicole

Answer 1 · 2023-05-12T21:31:51.000Z

The UCSC browser GFF3 parse can't parse this either; it is invalid.

Answer 2 · 2023-07-06T18:05:00.000Z

Hello @nrockweiler, thanks for reporting this. We fixed all formatting issues and updated to v5.1.

@diekhans confirmed the updated v5.1 passing both the UCSC browser GFF3 parse and GenomeTools gff3validator.

Let us know in case there are any other issues!

Best,
Arang

Answer 3 · 2023-07-07T13:14:02.000Z

Wonderful! Thank you so much.

…

On Thu, Jul 6, 2023, 2:05 PM Arang Rhie ***@***.***> wrote: Hello @nrockweiler <https://github.com/nrockweiler>, thanks for reporting this. We fixed all formatting issues and updated to v5.1 <https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/annotation/chm13v2.0_RefSeq_Liftoff_v5.1.gff3.gz> . @diekhans <https://github.com/diekhans> confirmed the updated v5.1 passing both the UCSC browser GFF3 parse and GenomeTools gff3validator. Let us know in case there are any other issues! Best, Arang — Reply to this email directly, view it on GitHub <#82 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACAJWBJL3HPSY4MZ3CRQF6LXO345PANCNFSM6AAAAAAX77IQVE> . You are receiving this because you were mentioned.Message ID: ***@***.***>