romikasaini/Pausepred_offline

script cannot parse all chromosome IDs

kaltinel opened this issue · 3 comments

Hi,

I am trying to run Pausepred with a ribosome profiling dataset, and I am having a problem of only receiving an output consisting of chrosome X,Y and MT.
I am using an Ensembl fasta file ( genome ), as I did my mapping to the genome.
My command like is as follows:

perl offline_pausepred.pl sorted.bam 1000 10 Homo_sapiens.GRCh38.dna.primary_assembly.fa 29,30,31,32,33,34 10 30 30 15,15,15,15,15,15 

It properly runs and then my output has only 3 chromosomes (I tried with multiple files)

cat output.txt | awk -F ',' '{print $1}' | tail -n +2 | sort | uniq -c
      1 gene_name
   3373 MT
  24791 X
    131 Y

I tried to add 'chr' after the'>'in fasta file, by thinking that seeing a number (for ex. >1) as chromosome information would cause an issue. However adding 'chr' string after'>'did not help

Therefore I would appreciate the help.

Thanks!

Hi,
Thank you for your quick reply. I appreciate it.

Yes, I do use the same genome file I used in STAR alignment. It is the same file..
I am not sure what could be the issue.

I experienced the same issue with reads mapped to Ensembl reference sequences. I could solve the problem by adding a 'chr' to the chromosome sequence IDs in the bam file and the reference sequence fasta. Since Ensemble only uses integers to name the chromosome sequences I am quite sure this is a parsing error, occuring when the bam file is read in. Unfortunately, I don't code perl so I can't create a fix.