script cannot parse all chromosome IDs
kaltinel opened this issue · 3 comments
Hi,
I am trying to run Pausepred with a ribosome profiling dataset, and I am having a problem of only receiving an output consisting of chrosome X,Y and MT.
I am using an Ensembl fasta file ( genome ), as I did my mapping to the genome.
My command like is as follows:
perl offline_pausepred.pl sorted.bam 1000 10 Homo_sapiens.GRCh38.dna.primary_assembly.fa 29,30,31,32,33,34 10 30 30 15,15,15,15,15,15
It properly runs and then my output has only 3 chromosomes (I tried with multiple files)
cat output.txt | awk -F ',' '{print $1}' | tail -n +2 | sort | uniq -c
1 gene_name
3373 MT
24791 X
131 Y
I tried to add 'chr'
after the'>'
in fasta file, by thinking that seeing a number (for ex. >1)
as chromosome information would cause an issue. However adding 'chr' string after'>'
did not help
Therefore I would appreciate the help.
Thanks!
Hi,
Thank you for your quick reply. I appreciate it.
Yes, I do use the same genome file I used in STAR alignment. It is the same file..
I am not sure what could be the issue.
I experienced the same issue with reads mapped to Ensembl reference sequences. I could solve the problem by adding a 'chr' to the chromosome sequence IDs in the bam file and the reference sequence fasta. Since Ensemble only uses integers to name the chromosome sequences I am quite sure this is a parsing error, occuring when the bam file is read in. Unfortunately, I don't code perl so I can't create a fix.