marbl/CHM13

GTF format of the orignal GFF

sum732 opened this issue · 7 comments

sum732 commented

Hello,

Some of the tools, like Pigeon/SQANTI3 for bulk RNA ISO-Seq requires GTF file.

Can you please provide GTF version of chm13.draft_v2.0.gene_annotation.gff3 , Original link
https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/annotation/chm13.draft_v2.0.gene_annotation.gff3`

I checked other version of GTF such as from https://projects.ensembl.org/hprc/ and Table Browser from UCSC etc. None of them have the same left of depth of information that is present in the original GFF3. For example I cannot find following entry any of the other GTF files from other sources:
chr1 CAT gene 97934895 97937928 . + . source_gene_common_name=MSTRG.282;source_gene=None;gene_biotype=StringTie;gene_id=CHM13_G0002360;gene_name=MSTRG.282;transcript_modes=exRef;ID=CHM13_G0002360;Name=MSTRG.282;source_transcript=N/A;alternative_source_transcripts=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;paralogy=N/A;unfiltered_paralogy=N/A;alignment_id=N/A;frameshift=N/A;exon_anotation_support=N/A;intron_annotation_support=N/A;transcript_class=N/A;valid_start=N/A;valid_stop=N/A;proper_orf=N/A;extra_paralog=False

I tried to convert GFF3 to GTF using agat, sorted it but Pigeon is not accepting it. I tried few other options but none of them are working.
It would be great to have GTF version of the file chm13.draft_v2.0.gene_annotation.gff3

Many Thanks
SM

sum732 commented

Hi @diekhans,
Thanks for replying.

Indeed it is one of the option, but most the details are ignored. Example here

Also in the original GFF3 there are following entries, please notice the START and END of the first 3 and next 2:

chr1    Liftoff transcript      146568094       146569221       .       -       .       gene_name=NBPF14;source_gene=ENSG00000270629.6;gene_biotype=protein_coding;transcript_biotype=protein_coding;source_transcript=ENST00000619423.4;Name=NBPF14;source_gene_common_name=NBPF14;extra_paralog=False;gene_id=LOFF_G0000157;Parent=LOFF_G0000157;transcript_id=LOFF_T0000224;transcript_name=NBPF14-1;ID=LOFF_T0000224;alignment_id=N/A;alternative_source_transcripts=N/A;paralogy=N/A;unfiltered_paralogy=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;frameshift=N/A;exon_annotation_support=N/A;intron_annotation_support=N/A;transcript_class=ortholog;transcript_modes=Liftoff;valid_start=N/A;valid_stop=N/A;proper_orf=N/A
chr1    Liftoff exon    146568094       146569221       .       -       .       gene_name=NBPF14;source_gene=ENSG00000270629.6;gene_biotype=protein_coding;transcript_biotype=protein_coding;source_transcript=ENST00000619423.4;Name=NBPF14;source_gene_common_name=NBPF14;extra_paralog=False;gene_id=LOFF_G0000157;transcript_id=LOFF_T0000224;transcript_name=NBPF14-0;Parent=LOFF_T0000224;ID=exon:LOFF_T0000224:0;alignment_id=N/A;alternative_source_transcripts=N/A;paralogy=N/A;unfiltered_paralogy=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;frameshift=N/A;exon_annotation_support=N/A;intron_annotation_support=N/A;transcript_class=ortholog;transcript_modes=Liftoff;valid_start=N/A;valid_stop=N/A;proper_orf=N/A
chr1    Liftoff three_prime_UTR 146568094       146569221       .       -       .       gene_name=NBPF14;source_gene=ENSG00000270629.6;gene_biotype=protein_coding;transcript_biotype=protein_coding;source_transcript=ENST00000619423.4;Name=NBPF14;source_gene_common_name=NBPF14;extra_paralog=False;gene_id=LOFF_G0000157;transcript_id=LOFF_T0000224;transcript_name=NBPF14-0;Parent=LOFF_T0000224;ID=three_prime_UTR:LOFF_T0000224;alignment_id=N/A;alternative_source_transcripts=N/A;paralogy=N/A;unfiltered_paralogy=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;frameshift=N/A;exon_annotation_support=N/A;intron_annotation_support=N/A;transcript_class=ortholog;transcript_modes=Liftoff;valid_start=N/A;valid_stop=N/A;proper_orf=N/A

chr16   Liftoff transcript      13397099        13397303        .       -       .       gene_name=LINC02851;source_gene=ENSG00000229611.2;gene_biotype=lncRNA;transcript_biotype=lncRNA;source_transcript=ENST00000664463.1;Name=LINC02851;source_gene_common_name=LINC02851;extra_paralog=False;gene_id=LOFF_G0001003;Parent=LOFF_G0001003;transcript_id=LOFF_T0001232;transcript_name=LINC02851-1;ID=LOFF_T0001232;alignment_id=N/A;alternative_source_transcripts=N/A;paralogy=N/A;unfiltered_paralogy=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;frameshift=N/A;exon_annotation_support=N/A;intron_annotation_support=N/A;transcript_class=ortholog;transcript_modes=Liftoff;valid_start=N/A;valid_stop=N/A;proper_orf=N/A
chr16   Liftoff exon    13397099        13397303        .       -       .       gene_name=LINC02851;source_gene=ENSG00000229611.2;gene_biotype=lncRNA;transcript_biotype=lncRNA;source_transcript=ENST00000664463.1;Name=LINC02851;source_gene_common_name=LINC02851;extra_paralog=False;gene_id=LOFF_G0001003;transcript_id=LOFF_T0001232;transcript_name=LINC02851-0;Parent=LOFF_T0001232;ID=exon:LOFF_T0001232:0;alignment_id=N/A;alternative_source_transcripts=N/A;paralogy=N/A;unfiltered_paralogy=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;frameshift=N/A;exon_annotation_support=N/A;intron_annotation_support=N/A;transcript_class=ortholog;transcript_modes=Liftoff;valid_start=N/A;valid_stop=N/A;proper_orf=N/A

Should they be merged? If so how what biotype should be given?
What else could be there that need further attention, hence the request to original authors to generate a GTF version as well.

Best Regards
SM

sum732 commented

I don't understand the issue with your example. They are on different chromosomes and transcripts of different genes. Can you explain in more detail?

Hi @diekhans , thanks for replying. The example above are two different things with the same issue.
Lets take the first. Same start and end, should these be collapsed? and if so what should be the Biotype?

chr1    Liftoff transcript      146568094       146569221       .       -       .       gene_name=NBPF14;source_gene=ENSG00000270629.6;gene_biotype=protein_coding;transcript_biotype=protein_coding;source_transcript=ENST00000619423.4;Name=NBPF14;source_gene_common_name=NBPF14;extra_paralog=False;gene_id=LOFF_G0000157;Parent=LOFF_G0000157;transcript_id=LOFF_T0000224;transcript_name=NBPF14-1;ID=LOFF_T0000224;alignment_id=N/A;alternative_source_transcripts=N/A;paralogy=N/A;unfiltered_paralogy=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;frameshift=N/A;exon_annotation_support=N/A;intron_annotation_support=N/A;transcript_class=ortholog;transcript_modes=Liftoff;valid_start=N/A;valid_stop=N/A;proper_orf=N/A
chr1    Liftoff exon    146568094       146569221       .       -       .       gene_name=NBPF14;source_gene=ENSG00000270629.6;gene_biotype=protein_coding;transcript_biotype=protein_coding;source_transcript=ENST00000619423.4;Name=NBPF14;source_gene_common_name=NBPF14;extra_paralog=False;gene_id=LOFF_G0000157;transcript_id=LOFF_T0000224;transcript_name=NBPF14-0;Parent=LOFF_T0000224;ID=exon:LOFF_T0000224:0;alignment_id=N/A;alternative_source_transcripts=N/A;paralogy=N/A;unfiltered_paralogy=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;frameshift=N/A;exon_annotation_support=N/A;intron_annotation_support=N/A;transcript_class=ortholog;transcript_modes=Liftoff;valid_start=N/A;valid_stop=N/A;proper_orf=N/A
chr1    Liftoff three_prime_UTR 146568094       146569221       .       -       .       gene_name=NBPF14;source_gene=ENSG00000270629.6;gene_biotype=protein_coding;transcript_biotype=protein_coding;source_transcript=ENST00000619423.4;Name=NBPF14;source_gene_common_name=NBPF14;extra_paralog=False;gene_id=LOFF_G0000157;transcript_id=LOFF_T0000224;transcript_name=NBPF14-0;Parent=LOFF_T0000224;ID=three_prime_UTR:LOFF_T0000224;alignment_id=N/A;alternative_source_transcripts=N/A;paralogy=N/A;unfiltered_paralogy=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;frameshift=N/A;exon_annotation_support=N/A;intron_annotation_support=N/A;transcript_class=ortholog;transcript_modes=Liftoff;valid_start=N/A;valid_stop=N/A;proper_orf=N/A

sum732 commented

I see thanks!