marbl/CHM13

Latest JHU RefSeqv110 + Liftoff v5 gff3 does not pass gff3 validation

nrockweiler opened this issue · 3 comments

Hi,

I believe there a fair number of gff3 validation issues with the recent update of JHU RefSeqv110 + Liftoff v5 in in commit dafcf67.

I've been using the GenomeTools gff3validator tool to find these issues. Below is a summary of the issues:

  • 1115 records have an odd The key in the attributes column, e.g.,
$ grep -n -m 1 "The=" chm13v2.0_RefSeq_Liftoff_v5.gff3
80912:chr1	Liftoff	CDS	25137221	25137356	.	+	0	Parent=NM_001282867.1;db_xref=GeneID:6007;exception=annotated by transcript or proteomic data;gbkey=CDS;gene=RHD;inference=similar to AA sequence (same species):RefSeq:NP_001269796.1;note=isoform 3 is encoded by transcript variant 3;The=RefSeq protein has 1 substitution compared to this genomic sequence;product=blood group Rh(D) polypeptide isoform 3;protein_id=NP_001269796.1;exon_number=4;extra_copy_number=0
  • MIR3690_1 is a PAR gene and is on both chrX and chrY. To follow the convention for other PAR genes, I think the copy on chrX should be renamed MIR3690
$ grep -w "ID=MIR3690_1" chm13v2.0_RefSeq_Liftoff_v5.gff3 | cut -f 1
chrX
chrY
  • There is more than 1 ID element on line 3999636 (the IDs are NM_001320962.1 and TSPY10P):
$ grep -n -w "ID=NM_001320962.1;ID=TSPY10P" chm13v2.0_RefSeq_Liftoff_v5.gff3
3999636:chrY	Liftoff	transcript	9795914	9798710	.	+	.	ID=NM_001320962.1;ID=TSPY10P;Dbxref=GeneID:100289087%2CGenbank:NM_001320962.1%2CHGNC:HGNC:37473;Name=NM_001320962.1;gbkey=mRNA;gene=TSPY10P;product=testis specific protein Y-linked 10%252C transcript variant 2;transcript_id=NM_001320962.1;matches_ref_protein=False;valid_ORF=False;inframe_stop_codon=True;extra_copy_number=0
  • 80 records have a malformed key-value pair in the attributes column; the "key" is called IDNM* and there is no value. I think this is supposed to be ID=NM*, e.g.:
$ grep -m 1 -P "\tIDNM" chm13v2.0_RefSeq_Liftoff_v5.gff3
chrY	Liftoff	exon	9795914	9796445	.	+	.	IDNM_001320962.1-1;ID=NM_001320962.1;Dbxref=GeneID:100289087%2CGenbank:NM_001320962.1%2CHGNC:HGNC:37473;gbkey=mRNA;gene=TSPY10P;product=testis specific protein Y-linked 10%252C transcript variant 2;transcript_id=NM_001320962.1;extra_copy_number=0
  • While it didn't come up as a validation issue, I saw a lot of text where I thought it would be ascii characters, but it looked like maybe hex encodings, e.g., GeneID:100289087%2C, testis specific protein Y-linked 10%252C transcript variant etc. Maybe this has something to do with the mention of correct[ing the] special character issues from the original file in the README?

Thank you!
Nicole

The UCSC browser GFF3 parse can't parse this either; it is invalid.

Hello @nrockweiler, thanks for reporting this. We fixed all formatting issues and updated to v5.1.

@diekhans confirmed the updated v5.1 passing both the UCSC browser GFF3 parse and GenomeTools gff3validator.

Let us know in case there are any other issues!

Best,
Arang