Usage with Ensembl GTF or converted refFlat
harish0201 opened this issue · 8 comments
Hi David,
Hope you are doing well!
I have a question which has left me a bit stumped. We are studying the Indian variety of rice whose annotations are available in ENSEMBL.
I have provided the GTF, as converting to refFlat wasn't really working, however it seems that the number of features that are being read are a bit too much?
There are 255844 genes and 2140 chromosomes read from ../ref/Oryza_indica.ASM465v1.43.gtf
(ftp://ftp.ensemblgenomes.org/pub/release-43/plants/gtf/oryza_indica/Oryza_indica.ASM465v1.43.chr.gtf.gz)
I believe that there should be 40320-ish genes.
I'm converting it to refFlat using gtfToGenePred, where in it only seems to detect 2262 genes. I'm a bit confused about this disparity. I'm using the following to convert:
/apps/kent_utils/gtfToGenePred -genePredExt -geneNameAsName2 Oryza_indica.ASM465v1.43.gtf reflat.txt
This seems to throw an error:
Couldn't recognize annotation file format with strand_definition = 3 @ annotation.h line 112
which is kinda confusing. Do you have any ideas?
Oh well, it seems that some post-processing is needed on the gtfToGenePred generated refFlat.
The steps are as follows:
-
Generate the "gene-pred" formatted refFlat:
gtfToGenePred -genePredExt -geneNameAsName2 genome.gtf reflat.txt
-
generate a gene-transcript map
awk -F'\t' '{if($3=="transcript")print $NF}' genome.gtf | cut -d';' -f1-2| sed 's/gene_id "\|transcript_id "\|"//g' | sed 's/; /\t/g' | sort -Vk2 > gene_trans
- merge the gene-transcript map to the "gene-pred" refFlat:
join -1 2 -2 1 -t$'\t' gent_trans sorted.reflat.txt | awk -F'\t' 'BEGIN{OFS="\t"}{print $2,$1,$3,$4,$5,$6,$7,$8,$9,$10,$11}' > final_flat.txt
This should work ideally for GTFs obtained from ENSEMBL, but please do let me know in case this generally doesn't work.
Edit: The modified refFlat does allow defiant to recognize all the genes.
Hi @harish0201 thanks for pointing all of this out! let me know if there is anything that I can do to help
I don't know if this is related, but when I run 2 sets of triplicates defiant runs without problems.
But when I add the -a Homo_sapiens.GRCh38.Ensembl90.gtf option, it crashes.
Reading gene annotation ./Homo_sapiens.GRCh38.Ensembl90.gtf
There are 258612 genes and 47 chromosomes read from ./Homo_sapiens.GRCh38.Ensembl90.gtf.
There are 2 sets to be read
There will be 1 runs.
Run information will be in defiant_run_info.txt
***WARNING***
There is/are 0 label(s), yet 2 sets, just a warning that you may not have correctly labeled data.
***WARNING***
Now reading file ./SRR949210.readset_sorted.dedup.map.input into memory...
Now reading file ./SRR949211.readset_sorted.dedup.map.input into memory...
Now reading file ./SRR949212.readset_sorted.dedup.map.input into memory...
Now reading file ./SRR949213.readset_sorted.dedup.map.input into memory...
Now reading file ./SRR949214.readset_sorted.dedup.map.input into memory...
Now reading file ./SRR949215.readset_sorted.dedup.map.input into memory...
Details of the data have been written to defiant_run_info.txt.
There will be 1 runs split among 1 workers.
------
Beginning analysis.
Worker 1/1 is analyzing run 1/1: sets 1 vs. 2 with cutoffs c = 10 CpN = 5 d = 1 p = 0.05 P = 10.000000 S = 0
Now analyzing chromosome 1, at iteration loop 1/25
Segmentation fault
Hi @marcovth
please try the following code on your gtf to translate to a different format, running the following script gtf2refFlat.pl
thus:
perl gtf2reflat.pl Homo_sapiens.GRCh38.Ensembl90.gtf
#!/usr/bin/env perl
use strict;
use warnings FATAL => 'all';
use feature 'say';
use autodie ':all';
use Scalar::Util 'looks_like_number';
open my $gtf, '>', $ARGV[0];
open my $refFlat, '>', "$ARGV[0].refFlat";
while (<$gtf>) {
next if /^#/; # skip comment lines
my @l = split /\t/;
my $chr = $l[0];
my $start = $l[3];
my $end = $l[4];
foreach my $n ($start, $end) {
if (not looks_like_number($n)) {
die "$n isn't a number.";
}
}
my $sense = $l[6];
if ($sense !~ m/^[+-]$/) {
die "$sense is nother \"+\" nor \"-\".";
}
my $gene_id;
if ($l[8] =~ m/gene_id "([^"]+)/) {
$gene_id = $1
} else {
die "Can't find a gene in $_";
}
my $name;
if ($l[8] =~ m/gene_name "([^"]+)/) {
$name = $1
} else {
$name = $gene_id
}
# say $l[8]; exit;
say $refFlat join ("\t", $name, $gene_id, $chr, $sense, $start, $end);
}
please let me know if this works.
Please change open my $gtf, '>', $ARGV[0]; for open my $gtf, '<', $ARGV[0];
It works better, but still a crash ...
Now analyzing chromosome 1, at iteration loop 1/25
Now analyzing chromosome 10, at iteration loop 2/25
Now analyzing chromosome 11, at iteration loop 3/25
Now analyzing chromosome 12, at iteration loop 4/25
Now analyzing chromosome 13, at iteration loop 5/25
Now analyzing chromosome 14, at iteration loop 6/25
Now analyzing chromosome 15, at iteration loop 7/25
Now analyzing chromosome 16, at iteration loop 8/25
Now analyzing chromosome 17, at iteration loop 9/25
Now analyzing chromosome 18, at iteration loop 10/25
Now analyzing chromosome 19, at iteration loop 11/25
Now analyzing chromosome 2, at iteration loop 12/25
Now analyzing chromosome 20, at iteration loop 13/25
Now analyzing chromosome 21, at iteration loop 14/25
Now analyzing chromosome 22, at iteration loop 15/25
Now analyzing chromosome 3, at iteration loop 16/25
Now analyzing chromosome 4, at iteration loop 17/25
Now analyzing chromosome 5, at iteration loop 18/25
Now analyzing chromosome 6, at iteration loop 19/25
Now analyzing chromosome 7, at iteration loop 20/25
Now analyzing chromosome 8, at iteration loop 21/25
Now analyzing chromosome 9, at iteration loop 22/25
Now analyzing chromosome M, at iteration loop 23/25
Now analyzing chromosome X, at iteration loop 24/25
Now analyzing chromosome Y, at iteration loop 25/25
Answers written to set1_vs_set2_c10_CpN5_d1_p0.05_P10.tsv
A gene list, with intergenic genes identified with promoter cutoff = 10000 nucleotides has been written to set1_vs_set2_c10_CpN5_d1_p0.05_P10_methylation_up_gene_list.txt
A gene list, with intergenic genes identified with promoter cutoff = 10000 nucleotides has been written to set1_vs_set2_c10_CpN5_d1_p0.05_P10_methylation_down_gene_list.txt
7042602/13828751 = 50.9% of nucleotides met minimum coverage and were present with minimum coverage = 10 in every sample.
**Segmentation fault**
Would you be able to print out/attach the output from running with -debug
?
the output appears to be being made, are the files that you get in output usable?
Here are the last lines ...
just finished chromosome X @ line 2077
Now analyzing chromosome Y, at iteration loop 25/25
@line 1491 => minimum nucleotide (index 10259590) for set 0-replicate 0 is 2782960 @ chromosome_loop 24 = Y
@line 1491 => minimum nucleotide (index 10258963) for set 0-replicate 1 is 2782960 @ chromosome_loop 24 = Y
@line 1491 => minimum nucleotide (index 11300572) for set 0-replicate 2 is 2782659 @ chromosome_loop 24 = Y
@line 1491 => minimum nucleotide (index 10630680) for set 1-replicate 0 is 2785510 @ chromosome_loop 24 = Y
@line 1491 => minimum nucleotide (index 10518286) for set 1-replicate 1 is 2782960 @ chromosome_loop 24 = Y
@line 1491 => minimum nucleotide (index 10579986) for set 1-replicate 2 is 2782960 @ chromosome_loop 24 = Y
@line 1568, max_gene_search_index = 4294967295
@ line 1586, GENES_chromosome_change_index = 24 -> searching 121966-122692 indices
@ line 1587 -> 1st nucleotide = 2752083; max nucleotide in chromosome = 0
@line 1596
just finished chromosome Y @ line 2077
Answers written to set1_vs_set2_c10_CpN5_d1_p0.05_P10.tsv
A gene list, with intergenic genes identified with promoter cutoff = 10000 nucleotides has been written to set1_vs_set2_c10_CpN5_d1_p0.05_P10_methylation_up_gene_list.txt
A gene list, with intergenic genes identified with promoter cutoff = 10000 nucleotides has been written to set1_vs_set2_c10_CpN5_d1_p0.05_P10_methylation_down_gene_list.txt
7042602/13828751 = 50.9% of nucleotides met minimum coverage and were present with minimum coverage = 10 in every sample.
Segmentation fault
I'm glad that you're getting the output files at least.
It's difficult for me to debug this without having the exact data set that you're using, I cannot reproduce your error.
Is there some way that you could give a very small sample of your data, a minimally reproducible example so that I could debug?