Usage with Ensembl GTF or converted refFlat

Question

Usage with Ensembl GTF or converted refFlat

harish0201 opened this issue 5 years ago · 8 comments

Hi David,

Hope you are doing well!

I have a question which has left me a bit stumped. We are studying the Indian variety of rice whose annotations are available in ENSEMBL.

I have provided the GTF, as converting to refFlat wasn't really working, however it seems that the number of features that are being read are a bit too much?

There are 255844 genes and 2140 chromosomes read from ../ref/Oryza_indica.ASM465v1.43.gtf (ftp://ftp.ensemblgenomes.org/pub/release-43/plants/gtf/oryza_indica/Oryza_indica.ASM465v1.43.chr.gtf.gz)

I believe that there should be 40320-ish genes.

I'm converting it to refFlat using gtfToGenePred, where in it only seems to detect 2262 genes. I'm a bit confused about this disparity. I'm using the following to convert:

/apps/kent_utils/gtfToGenePred -genePredExt -geneNameAsName2 Oryza_indica.ASM465v1.43.gtf reflat.txt

This seems to throw an error:
Couldn't recognize annotation file format with strand_definition = 3 @ annotation.h line 112

which is kinda confusing. Do you have any ideas?

Answer 1 · 2019-08-20T09:50:53.000Z

Oh well, it seems that some post-processing is needed on the gtfToGenePred generated refFlat.

The steps are as follows:

Generate the "gene-pred" formatted refFlat:
gtfToGenePred -genePredExt -geneNameAsName2 genome.gtf reflat.txt
generate a gene-transcript map

awk -F'\t' '{if($3=="transcript")print $NF}' genome.gtf | cut -d';' -f1-2| sed 's/gene_id "\|transcript_id "\|"//g' | sed 's/; /\t/g' | sort -Vk2 > gene_trans

merge the gene-transcript map to the "gene-pred" refFlat:
join -1 2 -2 1 -t$'\t' gent_trans sorted.reflat.txt | awk -F'\t' 'BEGIN{OFS="\t"}{print $2,$1,$3,$4,$5,$6,$7,$8,$9,$10,$11}' > final_flat.txt

This should work ideally for GTFs obtained from ENSEMBL, but please do let me know in case this generally doesn't work.

Edit: The modified refFlat does allow defiant to recognize all the genes.

Answer 2 · 2019-08-20T20:37:15.000Z

Hi @harish0201 thanks for pointing all of this out! let me know if there is anything that I can do to help

Answer 3 · 2022-04-11T11:10:31.000Z

I don't know if this is related, but when I run 2 sets of triplicates defiant runs without problems.
But when I add the -a Homo_sapiens.GRCh38.Ensembl90.gtf option, it crashes.

Reading gene annotation ./Homo_sapiens.GRCh38.Ensembl90.gtf
There are 258612 genes and 47 chromosomes read from ./Homo_sapiens.GRCh38.Ensembl90.gtf.
There are 2 sets to be read
There will be 1 runs.
Run information will be in defiant_run_info.txt

***WARNING***
There is/are 0 label(s), yet 2 sets, just a warning that you may not have correctly labeled data.
***WARNING***

Now reading file ./SRR949210.readset_sorted.dedup.map.input into memory...
Now reading file ./SRR949211.readset_sorted.dedup.map.input into memory...
Now reading file ./SRR949212.readset_sorted.dedup.map.input into memory...
Now reading file ./SRR949213.readset_sorted.dedup.map.input into memory...
Now reading file ./SRR949214.readset_sorted.dedup.map.input into memory...
Now reading file ./SRR949215.readset_sorted.dedup.map.input into memory...
Details of the data have been written to defiant_run_info.txt.
There will be 1 runs split among 1 workers.

------
Beginning analysis.

Worker 1/1 is analyzing run 1/1:        sets 1 vs. 2    with cutoffs    c = 10  CpN = 5 d = 1   p = 0.05        P = 10.000000                      S = 0
Now analyzing chromosome 1, at iteration loop 1/25
Segmentation fault

Answer 4 · 2022-04-11T15:13:00.000Z

Hi @marcovth

please try the following code on your gtf to translate to a different format, running the following script gtf2refFlat.pl thus:
perl gtf2reflat.pl Homo_sapiens.GRCh38.Ensembl90.gtf

#!/usr/bin/env perl

use strict;
use warnings FATAL => 'all';
use feature 'say';
use autodie ':all';
use Scalar::Util 'looks_like_number';

open my $gtf, '>', $ARGV[0];
open my $refFlat, '>', "$ARGV[0].refFlat";
while (<$gtf>) {
	next if /^#/; # skip comment lines
	my @l = split /\t/;
	my $chr = $l[0];
	my $start = $l[3];
	my $end = $l[4];
	foreach my $n ($start, $end) {
		if (not looks_like_number($n)) {
			die "$n isn't a number.";
		}
	}
	my $sense = $l[6];
	if ($sense !~ m/^[+-]$/) {
		die "$sense is nother \"+\" nor \"-\".";
	}
	
	my $gene_id;
	if ($l[8] =~ m/gene_id "([^"]+)/) {
		$gene_id = $1
	} else {
		die "Can't find a gene in $_";
	}
	my $name;
	if ($l[8] =~ m/gene_name "([^"]+)/) {
		$name = $1
	} else {
		$name = $gene_id
	}
#	say $l[8]; exit;
	say $refFlat join ("\t", $name, $gene_id, $chr, $sense, $start, $end);
}

please let me know if this works.

Answer 5 · 2022-04-11T19:54:41.000Z

Please change open my $gtf, '>', $ARGV[0]; for open my $gtf, '<', $ARGV[0];

It works better, but still a crash ...


Now analyzing chromosome 1, at iteration loop 1/25
Now analyzing chromosome 10, at iteration loop 2/25
Now analyzing chromosome 11, at iteration loop 3/25
Now analyzing chromosome 12, at iteration loop 4/25
Now analyzing chromosome 13, at iteration loop 5/25
Now analyzing chromosome 14, at iteration loop 6/25
Now analyzing chromosome 15, at iteration loop 7/25
Now analyzing chromosome 16, at iteration loop 8/25
Now analyzing chromosome 17, at iteration loop 9/25
Now analyzing chromosome 18, at iteration loop 10/25
Now analyzing chromosome 19, at iteration loop 11/25
Now analyzing chromosome 2, at iteration loop 12/25
Now analyzing chromosome 20, at iteration loop 13/25
Now analyzing chromosome 21, at iteration loop 14/25
Now analyzing chromosome 22, at iteration loop 15/25
Now analyzing chromosome 3, at iteration loop 16/25
Now analyzing chromosome 4, at iteration loop 17/25
Now analyzing chromosome 5, at iteration loop 18/25
Now analyzing chromosome 6, at iteration loop 19/25
Now analyzing chromosome 7, at iteration loop 20/25
Now analyzing chromosome 8, at iteration loop 21/25
Now analyzing chromosome 9, at iteration loop 22/25
Now analyzing chromosome M, at iteration loop 23/25
Now analyzing chromosome X, at iteration loop 24/25
Now analyzing chromosome Y, at iteration loop 25/25
Answers written to set1_vs_set2_c10_CpN5_d1_p0.05_P10.tsv
A gene list, with intergenic genes identified with promoter cutoff = 10000 nucleotides has been written to set1_vs_set2_c10_CpN5_d1_p0.05_P10_methylation_up_gene_list.txt
A gene list, with intergenic genes identified with promoter cutoff = 10000 nucleotides has been written to set1_vs_set2_c10_CpN5_d1_p0.05_P10_methylation_down_gene_list.txt
        7042602/13828751 = 50.9% of nucleotides met minimum coverage and were present with minimum coverage = 10 in every sample.
**Segmentation fault**

Answer 6 · 2022-04-11T20:19:08.000Z

Would you be able to print out/attach the output from running with -debug?

the output appears to be being made, are the files that you get in output usable?

Answer 7 · 2022-04-12T12:47:18.000Z

Here are the last lines ...

just finished chromosome X @ line 2077
Now analyzing chromosome Y, at iteration loop 25/25
@line 1491 => minimum nucleotide (index 10259590) for set 0-replicate 0 is 2782960 @ chromosome_loop 24 = Y
@line 1491 => minimum nucleotide (index 10258963) for set 0-replicate 1 is 2782960 @ chromosome_loop 24 = Y
@line 1491 => minimum nucleotide (index 11300572) for set 0-replicate 2 is 2782659 @ chromosome_loop 24 = Y
@line 1491 => minimum nucleotide (index 10630680) for set 1-replicate 0 is 2785510 @ chromosome_loop 24 = Y
@line 1491 => minimum nucleotide (index 10518286) for set 1-replicate 1 is 2782960 @ chromosome_loop 24 = Y
@line 1491 => minimum nucleotide (index 10579986) for set 1-replicate 2 is 2782960 @ chromosome_loop 24 = Y
@line 1568, max_gene_search_index = 4294967295
@ line 1586, GENES_chromosome_change_index = 24 -> searching 121966-122692 indices
@ line 1587 -> 1st nucleotide = 2752083; max nucleotide in chromosome = 0
@line 1596
just finished chromosome Y @ line 2077
Answers written to set1_vs_set2_c10_CpN5_d1_p0.05_P10.tsv
A gene list, with intergenic genes identified with promoter cutoff = 10000 nucleotides has been written to set1_vs_set2_c10_CpN5_d1_p0.05_P10_methylation_up_gene_list.txt
A gene list, with intergenic genes identified with promoter cutoff = 10000 nucleotides has been written to set1_vs_set2_c10_CpN5_d1_p0.05_P10_methylation_down_gene_list.txt
        7042602/13828751 = 50.9% of nucleotides met minimum coverage and were present with minimum coverage = 10 in every sample.
Segmentation fault

Answer 8 · 2022-04-12T13:17:18.000Z

I'm glad that you're getting the output files at least.

It's difficult for me to debug this without having the exact data set that you're using, I cannot reproduce your error.

Is there some way that you could give a very small sample of your data, a minimally reproducible example so that I could debug?