Empty output from from create_motif_top3_ru_forward.pl step
michael-kotliar opened this issue · 6 comments
Hello,
I was wondering if this part of the code from create_motif_top3_ru_forward.pl is correct
@hat=();
$i=$start-5;
while ($i<$end){
push @hat, $feature[$i];
$i++;
}
@hat=sort{$b<=>$a}@hat;
print NEW "$hat[0]\t$hat[1]\t$hat[2]\n";
It produces an empty file with only TAB symbols.
There isn't any guarantee that @feature
will have value on i
position.
Hi Michael,
Thanks for your question. I think this code is correct - it extracts the top 3 motif scan hits out of 200 hits within a 200-bp genomic interval. For example,
chr21 600 800
in this interval, there are 200 @feature
values for all these i=600-800
positions.
The genomic interval file in the bedgraph format can be downloaded from here: test_regions.blacklistfiltered.bed
My suggestion is that checking outputs of the previous step. For example, I run the following command on my machine:
python ./preprocess/sequence/scan_motif_by_chr.py chr21 TAF1 ./data/motif/ ./data/hg_genome/ ./data/seq_forward/
It generates a file ./data/seq_forward/TAF1_chr21
. This file has 48,129,875 lines and each line has a real value.
I this if this file is empty, then you may observe an empty file with only TAB symbols.
Let me know if you need more information.
Thanks,
Hongyang
No. It should be something like this - I used TAF1_chr21 as an example:
head TAF1_chr21
0.75397
0.75397
0.75397
0.75397
0.75397
0.75397
0.75397
0.75397
0.75397
0.75397
If it is empty in your case, then something wrong with this step:
python ./preprocess/sequence/scan_motif_by_chr.py chr21 TAF1 ./data/motif/ ./data/hg_genome/ ./data/seq_forward/
You may check if the input files for this step is correct and complete:
(i) motif PWM
We downloaded them from e.g. hocomoco11
(ii) hg19 sequences
The hg19 file can be downloaded from here. I separated it into individual files named as 'chr1', 'chr2', ... , 'chrX', so that you can run the motif scan code in parallel for multiple chromosomes.
Sorry, I accidentally deleted my previous post. Do I need to keep all sequences for chr in one line or it can be multiline? And also, I think you removed header >chrN
from all the files, correct?
That's what I have in the downloaded hg19 genome file.
>chr10
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
...
The length of the line is 50. You do something like max_length=length-20
in your scan_motif_by_chr.py
so as output from this script I can get only 30 lines of values.
Yes, you are correct - I removed the headers and only one line per chr (instead of multiple lines).
Thank you for pointing this out.
Cool! Thank you, I'll try to run it with corrected genome fasta files
You may try something like this:
perl -pe 'chomp unless /^>/' input > output
For example
input (multiple lines):
chr10
NNNNNNNN
NNNNNNNN
NNNNNNNN
NNNNNNNN
NNNNNNNN
output (two lines):
chr10
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN