hhg7/defiant

Usage with Ensembl GTF or converted refFlat

harish0201 opened this issue · 8 comments

Hi David,

Hope you are doing well!

I have a question which has left me a bit stumped. We are studying the Indian variety of rice whose annotations are available in ENSEMBL.

I have provided the GTF, as converting to refFlat wasn't really working, however it seems that the number of features that are being read are a bit too much?

There are 255844 genes and 2140 chromosomes read from ../ref/Oryza_indica.ASM465v1.43.gtf (ftp://ftp.ensemblgenomes.org/pub/release-43/plants/gtf/oryza_indica/Oryza_indica.ASM465v1.43.chr.gtf.gz)

I believe that there should be 40320-ish genes.

I'm converting it to refFlat using gtfToGenePred, where in it only seems to detect 2262 genes. I'm a bit confused about this disparity. I'm using the following to convert:

/apps/kent_utils/gtfToGenePred -genePredExt -geneNameAsName2 Oryza_indica.ASM465v1.43.gtf reflat.txt

This seems to throw an error:
Couldn't recognize annotation file format with strand_definition = 3 @ annotation.h line 112

which is kinda confusing. Do you have any ideas?

Oh well, it seems that some post-processing is needed on the gtfToGenePred generated refFlat.

The steps are as follows:

  1. Generate the "gene-pred" formatted refFlat:
    gtfToGenePred -genePredExt -geneNameAsName2 genome.gtf reflat.txt

  2. generate a gene-transcript map

awk -F'\t' '{if($3=="transcript")print $NF}' genome.gtf | cut -d';' -f1-2| sed 's/gene_id "\|transcript_id "\|"//g' | sed 's/; /\t/g' | sort -Vk2 > gene_trans

  1. merge the gene-transcript map to the "gene-pred" refFlat:
    join -1 2 -2 1 -t$'\t' gent_trans sorted.reflat.txt | awk -F'\t' 'BEGIN{OFS="\t"}{print $2,$1,$3,$4,$5,$6,$7,$8,$9,$10,$11}' > final_flat.txt

This should work ideally for GTFs obtained from ENSEMBL, but please do let me know in case this generally doesn't work.

Edit: The modified refFlat does allow defiant to recognize all the genes.

hhg7 commented

Hi @harish0201 thanks for pointing all of this out! let me know if there is anything that I can do to help

I don't know if this is related, but when I run 2 sets of triplicates defiant runs without problems.
But when I add the -a Homo_sapiens.GRCh38.Ensembl90.gtf option, it crashes.

Reading gene annotation ./Homo_sapiens.GRCh38.Ensembl90.gtf
There are 258612 genes and 47 chromosomes read from ./Homo_sapiens.GRCh38.Ensembl90.gtf.
There are 2 sets to be read
There will be 1 runs.
Run information will be in defiant_run_info.txt

***WARNING***
There is/are 0 label(s), yet 2 sets, just a warning that you may not have correctly labeled data.
***WARNING***

Now reading file ./SRR949210.readset_sorted.dedup.map.input into memory...
Now reading file ./SRR949211.readset_sorted.dedup.map.input into memory...
Now reading file ./SRR949212.readset_sorted.dedup.map.input into memory...
Now reading file ./SRR949213.readset_sorted.dedup.map.input into memory...
Now reading file ./SRR949214.readset_sorted.dedup.map.input into memory...
Now reading file ./SRR949215.readset_sorted.dedup.map.input into memory...
Details of the data have been written to defiant_run_info.txt.
There will be 1 runs split among 1 workers.

------
Beginning analysis.

Worker 1/1 is analyzing run 1/1:        sets 1 vs. 2    with cutoffs    c = 10  CpN = 5 d = 1   p = 0.05        P = 10.000000                      S = 0
Now analyzing chromosome 1, at iteration loop 1/25
Segmentation fault
hhg7 commented

Hi @marcovth

please try the following code on your gtf to translate to a different format, running the following script gtf2refFlat.pl thus:
perl gtf2reflat.pl Homo_sapiens.GRCh38.Ensembl90.gtf

#!/usr/bin/env perl

use strict;
use warnings FATAL => 'all';
use feature 'say';
use autodie ':all';
use Scalar::Util 'looks_like_number';

open my $gtf, '>', $ARGV[0];
open my $refFlat, '>', "$ARGV[0].refFlat";
while (<$gtf>) {
	next if /^#/; # skip comment lines
	my @l = split /\t/;
	my $chr = $l[0];
	my $start = $l[3];
	my $end = $l[4];
	foreach my $n ($start, $end) {
		if (not looks_like_number($n)) {
			die "$n isn't a number.";
		}
	}
	my $sense = $l[6];
	if ($sense !~ m/^[+-]$/) {
		die "$sense is nother \"+\" nor \"-\".";
	}
	
	my $gene_id;
	if ($l[8] =~ m/gene_id "([^"]+)/) {
		$gene_id = $1
	} else {
		die "Can't find a gene in $_";
	}
	my $name;
	if ($l[8] =~ m/gene_name "([^"]+)/) {
		$name = $1
	} else {
		$name = $gene_id
	}
#	say $l[8]; exit;
	say $refFlat join ("\t", $name, $gene_id, $chr, $sense, $start, $end);
}

please let me know if this works.

Please change open my $gtf, '>', $ARGV[0]; for open my $gtf, '<', $ARGV[0];

It works better, but still a crash ...


Now analyzing chromosome 1, at iteration loop 1/25
Now analyzing chromosome 10, at iteration loop 2/25
Now analyzing chromosome 11, at iteration loop 3/25
Now analyzing chromosome 12, at iteration loop 4/25
Now analyzing chromosome 13, at iteration loop 5/25
Now analyzing chromosome 14, at iteration loop 6/25
Now analyzing chromosome 15, at iteration loop 7/25
Now analyzing chromosome 16, at iteration loop 8/25
Now analyzing chromosome 17, at iteration loop 9/25
Now analyzing chromosome 18, at iteration loop 10/25
Now analyzing chromosome 19, at iteration loop 11/25
Now analyzing chromosome 2, at iteration loop 12/25
Now analyzing chromosome 20, at iteration loop 13/25
Now analyzing chromosome 21, at iteration loop 14/25
Now analyzing chromosome 22, at iteration loop 15/25
Now analyzing chromosome 3, at iteration loop 16/25
Now analyzing chromosome 4, at iteration loop 17/25
Now analyzing chromosome 5, at iteration loop 18/25
Now analyzing chromosome 6, at iteration loop 19/25
Now analyzing chromosome 7, at iteration loop 20/25
Now analyzing chromosome 8, at iteration loop 21/25
Now analyzing chromosome 9, at iteration loop 22/25
Now analyzing chromosome M, at iteration loop 23/25
Now analyzing chromosome X, at iteration loop 24/25
Now analyzing chromosome Y, at iteration loop 25/25
Answers written to set1_vs_set2_c10_CpN5_d1_p0.05_P10.tsv
A gene list, with intergenic genes identified with promoter cutoff = 10000 nucleotides has been written to set1_vs_set2_c10_CpN5_d1_p0.05_P10_methylation_up_gene_list.txt
A gene list, with intergenic genes identified with promoter cutoff = 10000 nucleotides has been written to set1_vs_set2_c10_CpN5_d1_p0.05_P10_methylation_down_gene_list.txt
        7042602/13828751 = 50.9% of nucleotides met minimum coverage and were present with minimum coverage = 10 in every sample.
**Segmentation fault**
hhg7 commented

Would you be able to print out/attach the output from running with -debug?

the output appears to be being made, are the files that you get in output usable?

Here are the last lines ...

just finished chromosome X @ line 2077
Now analyzing chromosome Y, at iteration loop 25/25
@line 1491 => minimum nucleotide (index 10259590) for set 0-replicate 0 is 2782960 @ chromosome_loop 24 = Y
@line 1491 => minimum nucleotide (index 10258963) for set 0-replicate 1 is 2782960 @ chromosome_loop 24 = Y
@line 1491 => minimum nucleotide (index 11300572) for set 0-replicate 2 is 2782659 @ chromosome_loop 24 = Y
@line 1491 => minimum nucleotide (index 10630680) for set 1-replicate 0 is 2785510 @ chromosome_loop 24 = Y
@line 1491 => minimum nucleotide (index 10518286) for set 1-replicate 1 is 2782960 @ chromosome_loop 24 = Y
@line 1491 => minimum nucleotide (index 10579986) for set 1-replicate 2 is 2782960 @ chromosome_loop 24 = Y
@line 1568, max_gene_search_index = 4294967295
@ line 1586, GENES_chromosome_change_index = 24 -> searching 121966-122692 indices
@ line 1587 -> 1st nucleotide = 2752083; max nucleotide in chromosome = 0
@line 1596
just finished chromosome Y @ line 2077
Answers written to set1_vs_set2_c10_CpN5_d1_p0.05_P10.tsv
A gene list, with intergenic genes identified with promoter cutoff = 10000 nucleotides has been written to set1_vs_set2_c10_CpN5_d1_p0.05_P10_methylation_up_gene_list.txt
A gene list, with intergenic genes identified with promoter cutoff = 10000 nucleotides has been written to set1_vs_set2_c10_CpN5_d1_p0.05_P10_methylation_down_gene_list.txt
        7042602/13828751 = 50.9% of nucleotides met minimum coverage and were present with minimum coverage = 10 in every sample.
Segmentation fault
hhg7 commented

I'm glad that you're getting the output files at least.

It's difficult for me to debug this without having the exact data set that you're using, I cannot reproduce your error.

Is there some way that you could give a very small sample of your data, a minimally reproducible example so that I could debug?