GTF format of the orignal GFF
sum732 opened this issue · 7 comments
Hello,
Some of the tools, like Pigeon/SQANTI3 for bulk RNA ISO-Seq requires GTF file.
Can you please provide GTF version of chm13.draft_v2.0.gene_annotation.gff3 , Original link
https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/annotation/chm13.draft_v2.0.gene_annotation.gff3`
I checked other version of GTF such as from https://projects.ensembl.org/hprc/
and Table Browser from UCSC etc. None of them have the same left of depth of information that is present in the original GFF3. For example I cannot find following entry any of the other GTF files from other sources:
chr1 CAT gene 97934895 97937928 . + . source_gene_common_name=MSTRG.282;source_gene=None;gene_biotype=StringTie;gene_id=CHM13_G0002360;gene_name=MSTRG.282;transcript_modes=exRef;ID=CHM13_G0002360;Name=MSTRG.282;source_transcript=N/A;alternative_source_transcripts=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;paralogy=N/A;unfiltered_paralogy=N/A;alignment_id=N/A;frameshift=N/A;exon_anotation_support=N/A;intron_annotation_support=N/A;transcript_class=N/A;valid_start=N/A;valid_stop=N/A;proper_orf=N/A;extra_paralog=False
I tried to convert GFF3 to GTF using agat
, sorted it but Pigeon is not accepting it. I tried few other options but none of them are working.
It would be great to have GTF version of the file chm13.draft_v2.0.gene_annotation.gff3
Many Thanks
SM
Hi @diekhans,
Thanks for replying.
Indeed it is one of the option, but most the details are ignored. Example here
Also in the original GFF3 there are following entries, please notice the START and END of the first 3 and next 2:
chr1 Liftoff transcript 146568094 146569221 . - . gene_name=NBPF14;source_gene=ENSG00000270629.6;gene_biotype=protein_coding;transcript_biotype=protein_coding;source_transcript=ENST00000619423.4;Name=NBPF14;source_gene_common_name=NBPF14;extra_paralog=False;gene_id=LOFF_G0000157;Parent=LOFF_G0000157;transcript_id=LOFF_T0000224;transcript_name=NBPF14-1;ID=LOFF_T0000224;alignment_id=N/A;alternative_source_transcripts=N/A;paralogy=N/A;unfiltered_paralogy=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;frameshift=N/A;exon_annotation_support=N/A;intron_annotation_support=N/A;transcript_class=ortholog;transcript_modes=Liftoff;valid_start=N/A;valid_stop=N/A;proper_orf=N/A
chr1 Liftoff exon 146568094 146569221 . - . gene_name=NBPF14;source_gene=ENSG00000270629.6;gene_biotype=protein_coding;transcript_biotype=protein_coding;source_transcript=ENST00000619423.4;Name=NBPF14;source_gene_common_name=NBPF14;extra_paralog=False;gene_id=LOFF_G0000157;transcript_id=LOFF_T0000224;transcript_name=NBPF14-0;Parent=LOFF_T0000224;ID=exon:LOFF_T0000224:0;alignment_id=N/A;alternative_source_transcripts=N/A;paralogy=N/A;unfiltered_paralogy=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;frameshift=N/A;exon_annotation_support=N/A;intron_annotation_support=N/A;transcript_class=ortholog;transcript_modes=Liftoff;valid_start=N/A;valid_stop=N/A;proper_orf=N/A
chr1 Liftoff three_prime_UTR 146568094 146569221 . - . gene_name=NBPF14;source_gene=ENSG00000270629.6;gene_biotype=protein_coding;transcript_biotype=protein_coding;source_transcript=ENST00000619423.4;Name=NBPF14;source_gene_common_name=NBPF14;extra_paralog=False;gene_id=LOFF_G0000157;transcript_id=LOFF_T0000224;transcript_name=NBPF14-0;Parent=LOFF_T0000224;ID=three_prime_UTR:LOFF_T0000224;alignment_id=N/A;alternative_source_transcripts=N/A;paralogy=N/A;unfiltered_paralogy=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;frameshift=N/A;exon_annotation_support=N/A;intron_annotation_support=N/A;transcript_class=ortholog;transcript_modes=Liftoff;valid_start=N/A;valid_stop=N/A;proper_orf=N/A
chr16 Liftoff transcript 13397099 13397303 . - . gene_name=LINC02851;source_gene=ENSG00000229611.2;gene_biotype=lncRNA;transcript_biotype=lncRNA;source_transcript=ENST00000664463.1;Name=LINC02851;source_gene_common_name=LINC02851;extra_paralog=False;gene_id=LOFF_G0001003;Parent=LOFF_G0001003;transcript_id=LOFF_T0001232;transcript_name=LINC02851-1;ID=LOFF_T0001232;alignment_id=N/A;alternative_source_transcripts=N/A;paralogy=N/A;unfiltered_paralogy=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;frameshift=N/A;exon_annotation_support=N/A;intron_annotation_support=N/A;transcript_class=ortholog;transcript_modes=Liftoff;valid_start=N/A;valid_stop=N/A;proper_orf=N/A
chr16 Liftoff exon 13397099 13397303 . - . gene_name=LINC02851;source_gene=ENSG00000229611.2;gene_biotype=lncRNA;transcript_biotype=lncRNA;source_transcript=ENST00000664463.1;Name=LINC02851;source_gene_common_name=LINC02851;extra_paralog=False;gene_id=LOFF_G0001003;transcript_id=LOFF_T0001232;transcript_name=LINC02851-0;Parent=LOFF_T0001232;ID=exon:LOFF_T0001232:0;alignment_id=N/A;alternative_source_transcripts=N/A;paralogy=N/A;unfiltered_paralogy=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;frameshift=N/A;exon_annotation_support=N/A;intron_annotation_support=N/A;transcript_class=ortholog;transcript_modes=Liftoff;valid_start=N/A;valid_stop=N/A;proper_orf=N/A
Should they be merged? If so how what biotype should be given?
What else could be there that need further attention, hence the request to original authors to generate a GTF version as well.
Best Regards
SM
I don't understand the issue with your example. They are on different chromosomes and transcripts of different genes. Can you explain in more detail?
…
Hi @diekhans , thanks for replying. The example above are two different things with the same issue.
Lets take the first. Same start and end, should these be collapsed? and if so what should be the Biotype?
chr1 Liftoff transcript 146568094 146569221 . - . gene_name=NBPF14;source_gene=ENSG00000270629.6;gene_biotype=protein_coding;transcript_biotype=protein_coding;source_transcript=ENST00000619423.4;Name=NBPF14;source_gene_common_name=NBPF14;extra_paralog=False;gene_id=LOFF_G0000157;Parent=LOFF_G0000157;transcript_id=LOFF_T0000224;transcript_name=NBPF14-1;ID=LOFF_T0000224;alignment_id=N/A;alternative_source_transcripts=N/A;paralogy=N/A;unfiltered_paralogy=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;frameshift=N/A;exon_annotation_support=N/A;intron_annotation_support=N/A;transcript_class=ortholog;transcript_modes=Liftoff;valid_start=N/A;valid_stop=N/A;proper_orf=N/A
chr1 Liftoff exon 146568094 146569221 . - . gene_name=NBPF14;source_gene=ENSG00000270629.6;gene_biotype=protein_coding;transcript_biotype=protein_coding;source_transcript=ENST00000619423.4;Name=NBPF14;source_gene_common_name=NBPF14;extra_paralog=False;gene_id=LOFF_G0000157;transcript_id=LOFF_T0000224;transcript_name=NBPF14-0;Parent=LOFF_T0000224;ID=exon:LOFF_T0000224:0;alignment_id=N/A;alternative_source_transcripts=N/A;paralogy=N/A;unfiltered_paralogy=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;frameshift=N/A;exon_annotation_support=N/A;intron_annotation_support=N/A;transcript_class=ortholog;transcript_modes=Liftoff;valid_start=N/A;valid_stop=N/A;proper_orf=N/A
chr1 Liftoff three_prime_UTR 146568094 146569221 . - . gene_name=NBPF14;source_gene=ENSG00000270629.6;gene_biotype=protein_coding;transcript_biotype=protein_coding;source_transcript=ENST00000619423.4;Name=NBPF14;source_gene_common_name=NBPF14;extra_paralog=False;gene_id=LOFF_G0000157;transcript_id=LOFF_T0000224;transcript_name=NBPF14-0;Parent=LOFF_T0000224;ID=three_prime_UTR:LOFF_T0000224;alignment_id=N/A;alternative_source_transcripts=N/A;paralogy=N/A;unfiltered_paralogy=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;frameshift=N/A;exon_annotation_support=N/A;intron_annotation_support=N/A;transcript_class=ortholog;transcript_modes=Liftoff;valid_start=N/A;valid_stop=N/A;proper_orf=N/A
I see thanks!