Empty output from from create_motif_top3_ru_forward.pl step

Question

Empty output from from create_motif_top3_ru_forward.pl step

michael-kotliar opened this issue 5 years ago · 6 comments

Hello,
I was wondering if this part of the code from create_motif_top3_ru_forward.pl is correct

@hat=();
$i=$start-5;
while ($i<$end){
	push @hat, $feature[$i];
	$i++;
}
@hat=sort{$b<=>$a}@hat;
print NEW "$hat[0]\t$hat[1]\t$hat[2]\n";

It produces an empty file with only TAB symbols.
There isn't any guarantee that @feature will have value on i position.

Answer 1 · 2019-11-19T17:03:01.000Z

Hi Michael,

Thanks for your question. I think this code is correct - it extracts the top 3 motif scan hits out of 200 hits within a 200-bp genomic interval. For example,
chr21 600 800
in this interval, there are 200 @feature values for all these i=600-800 positions.
The genomic interval file in the bedgraph format can be downloaded from here: test_regions.blacklistfiltered.bed

My suggestion is that checking outputs of the previous step. For example, I run the following command on my machine:
python ./preprocess/sequence/scan_motif_by_chr.py chr21 TAF1 ./data/motif/ ./data/hg_genome/ ./data/seq_forward/
It generates a file ./data/seq_forward/TAF1_chr21. This file has 48,129,875 lines and each line has a real value.
I this if this file is empty, then you may observe an empty file with only TAB symbols.

Let me know if you need more information.

Thanks,
Hongyang

Answer 2 · 2019-11-19T20:16:20.000Z

No. It should be something like this - I used TAF1_chr21 as an example:

head TAF1_chr21 
0.75397
0.75397
0.75397
0.75397
0.75397
0.75397
0.75397
0.75397
0.75397
0.75397

If it is empty in your case, then something wrong with this step:
python ./preprocess/sequence/scan_motif_by_chr.py chr21 TAF1 ./data/motif/ ./data/hg_genome/ ./data/seq_forward/
You may check if the input files for this step is correct and complete:
(i) motif PWM
We downloaded them from e.g. hocomoco11
(ii) hg19 sequences
The hg19 file can be downloaded from here. I separated it into individual files named as 'chr1', 'chr2', ... , 'chrX', so that you can run the motif scan code in parallel for multiple chromosomes.

Answer 3 · 2019-11-19T20:25:49.000Z

Sorry, I accidentally deleted my previous post. Do I need to keep all sequences for chr in one line or it can be multiline? And also, I think you removed header >chrN from all the files, correct?
That's what I have in the downloaded hg19 genome file.

>chr10
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
...

The length of the line is 50. You do something like max_length=length-20 in your scan_motif_by_chr.py so as output from this script I can get only 30 lines of values.

Answer 4 · 2019-11-19T20:32:17.000Z

Yes, you are correct - I removed the headers and only one line per chr (instead of multiple lines).
Thank you for pointing this out.

Answer 5 · 2019-11-19T20:33:26.000Z

Cool! Thank you, I'll try to run it with corrected genome fasta files

Answer 6 · 2019-11-19T20:40:12.000Z

You may try something like this:
perl -pe 'chomp unless /^>/' input > output

For example
input (multiple lines):

chr10
NNNNNNNN
NNNNNNNN
NNNNNNNN
NNNNNNNN
NNNNNNNN

output (two lines):

chr10
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN