Problem using the provided GFF3 annotations
mvelinder opened this issue · 10 comments
I'm having trouble using the provided GFF3 annotations from https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/annotation/chm13.draft_v1.1.gene_annotation.v4.gff3.gz
Here's what I'm doing:
bcftools csq $VCF -f chm13.draft_v1.1.fasta -g chm13.draft_v1.1.gene_annotation.v4.gff3.gz -O z -o $VCF.csq.vcf.gz
And here's my ouput:
Parsing chm13.draft_v1.1.gene_annotation.v4.gff3.gz ...
ignored: chr1 CAT gene 14253 21099 . + . source_gene_common_name=AC114498.1;source_gene=ENSG00000235146.2;gene_biotype=lncRNA;gene_alternate_contigs=chr6:172104635-172111468;gene_id=CHM13_G0000001;gene_name=AC114498.1;transcript_modes=transMap;ID=CHM13_G0000001;Name=AC114498.1;source_transcript=N/A;alternative_source_transcripts=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;paralogy=N/A;unfiltered_paralogy=N/A;alignment_id=N/A;frameshift=N/A;exon_anotation_support=N/A;intron_annotation_support=N/A;transcript_class=N/A;valid_start=N/A;valid_stop=N/A;proper_orf=N/A;extra_paralog=False
ignored: chr1 CAT transcript 14253 21099 8940 + . source_transcript=ENST00000423796.1;source_transcript_name=AC114498.1-201;source_gene=ENSG00000235146.2;transcript_modes=transMap;gene_biotype=lncRNA;transcript_biotype=lncRNA;alignment_id=ENST00000423796.1-1;frameshift=nan;exon_annotation_support=1,1;intron_annotation_support=1;transcript_class=ortholog;valid_start=True;valid_stop=True;adj_start=nan;adj_stop=nan;proper_orf=True;level=2;transcript_support_level=5;tag=not_best_in_genome_evidence,basic;havana_gene=OTTHUMG00000002329.1;havana_transcript=OTTHUMT00000006707.1;paralogy=nan;unfiltered_paralogy=ENST00000423796.1-2;gene_alternate_contigs=chr6:172104635-172111468;source_gene_common_name=AC114498.1;transcript_id=CHM13_T0000001;gene_id=CHM13_G0000001;Parent=CHM13_G0000001;transcript_name=AC114498.1-201;ID=CHM13_T0000001;Name=AC114498.1;gene_name=CHM13_G0000001;alternative_source_transcripts=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;extra_paralog=False
[csq.c:715 gff_id_parse] Could not parse the line, "Parent=transcript:" not present: chr1 CAT exon 14253 14325 . + . source_transcript=ENST00000423796.1;source_transcript_name=AC114498.1-201;source_gene=ENSG00000235146.2;transcript_modes=transMap;gene_biotype=lncRNA;transcript_biotype=lncRNA;alignment_id=ENST00000423796.1-1;exon_annotation_support=1,1;intron_annotation_support=1;transcript_class=ortholog;valid_start=True;valid_stop=True;adj_start=nan;adj_stop=nan;proper_orf=True;level=2;transcript_support_level=5;tag=not_best_in_genome_evidence,basic;havana_gene=OTTHUMG00000002329.1;havana_transcript=OTTHUMT00000006707.1;paralogy=nan;unfiltered_paralogy=ENST00000423796.1-2;gene_alternate_contigs=chr6:172104635-172111468;source_gene_common_name=AC114498.1;transcript_id=CHM13_T0000001;gene_id=CHM13_G0000001;Parent=CHM13_T0000001;transcript_name=AC114498.1-201;ID=exon:CHM13_T0000001:0;Name=AC114498.1;rna_support=N/A;reference_support=True;gene_name=CHM13_G0000001;alternative_source_transcripts=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;frameshift=N/A;extra_paralog=False
Any guidance would be much appreciated! Thanks!
This appears to be a bug in bcftools. The record it is complaining about has Parent=CHM13_T0000001 and there is a transcript record ID=CHM13_T0000001
I have adjusted the gff3 with the following horrible code (to include "transcript:", "gene:", prefixes for Parent and ID:
zcat chm13.draft_v1.1.gene_annotation.v4.gff3.gz \
| gawk 'BEGIN{IGNORECASE=0} $0 ~/^#/ || ($3 != "gene" && $3 != "transcript" ) { $0=gensub(/Parent=(.+?_T[^;]+)/, "Parent=transcript:\\1", "g", $0); print; next} $0 !~/^#/ { gsub(/ID=/, "ID="$3":"); $0=gensub(/Parent=(.+?_G[^;]+)/, "Parent=gene:\\1", "g", $0); print }' \
| bgzip -c > chm13.draft_v1.1.gene_annotation.v4.fix.gff3.gz
but then I still get:
Error: GFF3 assumption failed for transcript CHM13_T0000003, CDS=111940: phase!=len%3 (phase=2, len=379). Use the --force option to proceed anyway (at your own risk).
from bcftools csq. Then ensembl gffs have an ensembl_phase=\d
.
Is it safe to use --force? What did you use to annotate VCFs called against this reference?
thanks.
Using @brentp 's fix I get the following. Looks like it's just ignoring everything? I get no file written to disk either
bcftools csq -f chm13.draft_v1.1.fasta -g chm13.draft_v1.1.gene_annotation.v4.fix.gff3.gz $VCF -O z -o $VCF.csq.gz
Example tail
of the output:
ignored: chr21 CAT intron 3130264 3130424 . + . source_transcript=ENST00000651813.1;source_transcript_name=FP236383.3-202;source_gene=ENSG00000280441.3;transcript_modes=transMap;gene_biotype=lncRNA;transcript_biotype=lncRNA;alignment_id=ENST00000651813.1-8;exon_annotation_support=1,1,1,1,1;intron_annotation_support=1,1,1,1;transcript_class=ortholog;valid_start=True;valid_stop=True;adj_start=nan;adj_stop=nan;proper_orf=True;level=2;tag=basic;havana_gene=OTTHUMG00000189419.3;havana_transcript=OTTHUMT00000504484.1;paralogy=nan;unfiltered_paralogy=ENST00000651813.1-5;gene_alternate_contigs=chr13:9846569-9851263,chr13:9891327-9896019,chr13:9936209-9940881,chr13:9985468-9990129,chr14:2095590-2099024,chr15:2501991-2530372,chr21:6313707-6318384,chr22:5713991-5718656;possible_split_gene_locations=chr21:6313707-6318384,chr22:5713991-5718656,chr13:9846569-9851263,chr13:9891327-9896019,chr13:9936209-9940881,chr13:9985468-9990129,chr14:2095590-2099024;source_gene_common_name=FP236383.3;transcript_id=CHM13_T0140977;gene_id=CHM13_G0035621;Parent=transcript:CHM13_T0140977;transcript_name=FP236383.3-202;ID=intron:CHM13_T0140977:1;Name=FP236383.3;rna_support=N/A;reference_support=True;gene_name=CHM13_G0035621;alternative_source_transcripts=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;frameshift=N/A;extra_paralog=False;nucmer_liftover=complete;nucmer_liftover_ed=11
ignored: chr21 CAT intron 3130517 3130744 . + . source_transcript=ENST00000651813.1;source_transcript_name=FP236383.3-202;source_gene=ENSG00000280441.3;transcript_modes=transMap;gene_biotype=lncRNA;transcript_biotype=lncRNA;alignment_id=ENST00000651813.1-8;exon_annotation_support=1,1,1,1,1;intron_annotation_support=1,1,1,1;transcript_class=ortholog;valid_start=True;valid_stop=True;adj_start=nan;adj_stop=nan;proper_orf=True;level=2;tag=basic;havana_gene=OTTHUMG00000189419.3;havana_transcript=OTTHUMT00000504484.1;paralogy=nan;unfiltered_paralogy=ENST00000651813.1-5;gene_alternate_contigs=chr13:9846569-9851263,chr13:9891327-9896019,chr13:9936209-9940881,chr13:9985468-9990129,chr14:2095590-2099024,chr15:2501991-2530372,chr21:6313707-6318384,chr22:5713991-5718656;possible_split_gene_locations=chr21:6313707-6318384,chr22:5713991-5718656,chr13:9846569-9851263,chr13:9891327-9896019,chr13:9936209-9940881,chr13:9985468-9990129,chr14:2095590-2099024;source_gene_common_name=FP236383.3;transcript_id=CHM13_T0140977;gene_id=CHM13_G0035621;Parent=transcript:CHM13_T0140977;transcript_name=FP236383.3-202;ID=intron:CHM13_T0140977:2;Name=FP236383.3;rna_support=N/A;reference_support=True;gene_name=CHM13_G0035621;alternative_source_transcripts=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;frameshift=N/A;extra_paralog=False;nucmer_liftover=3130750;nucmer_liftover_ed=3130523
ignored: chr22 CAT intron 5731187 5749914 . - . source_transcript=ENST00000623236.1;source_transcript_name=CR392039.4-201;source_gene=ENSG00000279501.1;transcript_modes=transMap;gene_biotype=lncRNA;transcript_biotype=lncRNA;alignment_id=ENST00000623236.1-1;exon_annotation_support=1,1;intron_annotation_support=1;transcript_class=ortholog;valid_start=True;valid_stop=True;adj_start=nan;adj_stop=nan;proper_orf=True;level=2;transcript_support_level=5;tag=basic;havana_gene=OTTHUMG00000189478.1;havana_transcript=OTTHUMT00000479772.1;paralogy=nan;unfiltered_paralogy=ENST00000623236.1-0,ENST00000623236.1-2,ENST00000623236.1-3,ENST00000623236.1-4;source_gene_common_name=CR392039.4;transcript_id=CHM13_T0143797;gene_id=CHM13_G0036414;Parent=transcript:CHM13_T0143797;transcript_name=CR392039.4-201;ID=intron:CHM13_T0143797:0;Name=CR392039.4;rna_support=N/A;reference_support=True;gene_name=CHM13_G0036414;alternative_source_transcripts=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;frameshift=N/A;extra_paralog=False;nucmer_liftover=complete;nucmer_liftover_ed=3
@mvelinder I saw that as well, but it's ignoring all introns, I think that's OK since those can be inferred.
It seems the phase error is the one that remains.
@diekhans To Brent's question, if this is a bcftools problem, what is the recommended way to annotate variant calls on CHM13?
Thanks for the guidance!
Is there any updates on this issue?
Thank you very much!
DK
bcftools recommendation is to modify:
https://github.com/samtools/bcftools/blob/develop/misc/gff2gff.py
to support other GFF3 files. Converting the biotypes should be straightforward, as mostly the GENCODE/Ensembl ones are used. For the additional ones added by CAT, they can be mapped to Ensembl ones
@diekhans To Brent's question, if this is a bcftools problem, what is the recommended way to annotate variant calls on CHM13?
I am not familiar with the current variant calling tools, so I can't make a recommendation. However, it seems that modifying gff2gff is the recommended way to use bcftools.
I got this to work using VEP instead of bcftools csq
vep -gff /path/to/chm13.gff.gz
https://uswest.ensembl.org/info/docs/tools/vep/script/vep_cache.html#gff