cannot get accurate protein sequences from the gff file

Question

cannot get accurate protein sequences from the gff file

ATPs opened this issue 3 years ago · 2 comments

I tried to extracted the cds sequences from the gff file.

gffread -g chm13.draft_v1.1.fasta -x cds.fa chm13.draft_v1.1.gene_annotation.v4.gff3

however, when trying to translate the cds to proteins, the open reading frame is not correct for quite many sequences. Is there a way to download the predicted protein sequences?

Answer 1 · 2021-12-07T05:34:46.000Z

Hi @ATPs ,

I created a file with the predicted protein sequences here that you can use: http://courtyard.gi.ucsc.edu/~mhauknes/T2T/chm13.draft_v1.1.gene_annotation.protein.fasta

Answer 2 · 2021-12-07T21:31:56.000Z

These incorrect open reading frames are to be expected from the GENCODE annotation (they aren't errors). For example, many of the transcripts in GENCODE have tags like cds_end_NF and cds_start_NF which are fragments that are annotated (probably from ESTs) but have a lack of sufficient evidence. These are propagated down into our gene annotations. You can ignore any transcripts with the tag proper_orf=False in the gff3 if you want to include only transcripts with full, proper ORFs.