script cannot parse all chromosome IDs

Question

script cannot parse all chromosome IDs

kaltinel opened this issue 4 years ago · 3 comments

Hi,

I am trying to run Pausepred with a ribosome profiling dataset, and I am having a problem of only receiving an output consisting of chrosome X,Y and MT.
I am using an Ensembl fasta file ( genome ), as I did my mapping to the genome.
My command like is as follows:

perl offline_pausepred.pl sorted.bam 1000 10 Homo_sapiens.GRCh38.dna.primary_assembly.fa 29,30,31,32,33,34 10 30 30 15,15,15,15,15,15

It properly runs and then my output has only 3 chromosomes (I tried with multiple files)

cat output.txt | awk -F ',' '{print $1}' | tail -n +2 | sort | uniq -c
      1 gene_name
   3373 MT
  24791 X
    131 Y

I tried to add 'chr' after the'>'in fasta file, by thinking that seeing a number (for ex. >1) as chromosome information would cause an issue. However adding 'chr' string after'>'did not help

Therefore I would appreciate the help.

Thanks!

Answer 1 · 2020-07-14T07:43:20.000Z

Hi, Please make sure that chromosome names are the same in both your alignment (.bam) and reference genome file (.fa). Are you using the same reference genome file for running Pausepred script which you used for generating the alignment file?

…

On Mon, Jul 13, 2020 at 6:04 PM kaltin ***@***.***> wrote: Hi, I am trying to run Pausepred with a ribosome profiling dataset, and I am having a problem of only receiving an output consisting of chrosome X,Y and MT. I am using an Ensembl fasta file ( genome ), as I did my mapping to the genome. My command like is as follows: perl offline_pausepred.pl sorted.bam 1000 10 Homo_sapiens.GRCh38.dna.primary_assembly.fa 29,30,31,32,33,34 10 30 30 15,15,15,15,15,15 It properly runs and then my output has only 3 chromosomes (I tried with multiple files) cat output.txt | awk -F ',' '{print $1}' | tail -n +2 | sort | uniq -c 1 gene_name 3373 MT 24791 X 131 Y I tried to add 'chr' after the'>'in fasta file, by thinking that seeing a number (>1) as chromosome information would cause an issue. However adding 'chr' string after'>'did not help Therefore I would appreciate the help. Thanks! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEMW2QM7HHCRZI654APWDTLR3MPBPANCNFSM4OYSLMUA> .

-- Romika Kumari Postdoctoral Researcher Institute for Molecular Medicine Finland (FIMM) Linkedin URL: https://www.linkedin.com/in/romika-kumari/ Research gate: https://www.researchgate.net/profile/Romika_Kumari

Answer 2 · 2020-07-14T08:38:16.000Z

Hi,
Thank you for your quick reply. I appreciate it.

Yes, I do use the same genome file I used in STAR alignment. It is the same file..
I am not sure what could be the issue.

Answer 3 · 2021-11-08T15:20:35.000Z

I experienced the same issue with reads mapped to Ensembl reference sequences. I could solve the problem by adding a 'chr' to the chromosome sequence IDs in the bam file and the reference sequence fasta. Since Ensemble only uses integers to name the chromosome sequences I am quite sure this is a parsing error, occuring when the bam file is read in. Unfortunately, I don't code perl so I can't create a fix.