Chromosome ID not present in genome fasta

Question

Chromosome ID not present in genome fasta

adriandrr opened this issue 2 years ago · 2 comments

Hey, I am currently trying to run your syntheny plot pipeline. I would be really glad if you could help me.

I am trying to create the syntheny between 12 bacteria de-novo generated whole genomes.

I mapped them with minimap A+B, B+C, C+D and so on. I used a bash script for that if you want to have a look:

#!/bin/bash

fasta_files=(0.fa 1.fa 2.fa 3.fa 4.fa 5.fa 6.fa 7.fa 9.fa 10.fa 11.fa 12.fa 13.fa)

for ((i = 0; i < ${#fasta_files[@]} - 1; i++)); do
    current_file="${fasta_files[$i]}"
    next_file="${fasta_files[$i + 1]}"
    current_prefix="${current_file%.}"
    next_prefix="${next_file%.}"
    output_bam="${current_prefix}${next_prefix}.bam"

    minimap2 -ax asm5 -t 4 --eqx "$current_file" "$next_file" | samtools sort -O BAM - > "$output_bam"
    samtools index "$output_bam"
done

After that I used a bash oneliner for-loop to produce the syri information:

for i in $(ls bam -1v); do prefix="${i%.}";IFS="" read -r fnum snum <<< "$prefix"; syri -c $i -r $fnum.fa -q $snum.fa -F B --prefix $prefix ;done

I am unsure if there is a problem with the syri information since I am not very familiar with that. The first 5 lines of the first syri output "0_1syri.out" look like this:

Chr0    1          4344    -           -           -           -           -           NOTAL1           -           NOTAL -
Chr0    4345    5189076          -           -           Chr0    1          5020588          SYN1    -           SYN      -
Chr0    4345    4896    -           -           Chr0    1          553      SYNAL1            SYN1    SYNAL -
Chr0    4484    4484    G         T          Chr0    140      140      SNP543           SYN1    SNP      -
Chr0    4485    4485    A          T          Chr0    141      141      SNP544           SYN1    SNP      -

what I now tried is to start plotsr with this command:

plotsr --sr 0_1syri.out --sr 1_2syri.out --sr 2_3syri.out --sr 3_4syri.out --sr 4_5syri.out --sr 5_6syri.out --sr 6_7syri.out --sr 7_9syri.out --sr 9_10syri.out --sr 10_11syri.out --sr 11_12syri.out --sr 12_13syri.out --genomes ../../genomes2.txt -o output_plot.png

First I wanted to use the main fasta files as input whereas the genomes2.txt file looked like that:

#file     name   tags
0.fa      0          lw:1.5
1.fa      1          lw:1.5
10.fa    10        lw:1.5
11.fa    11        lw:1.5
12.fa    12        lw:1.5
13.fa    13        lw:1.5
2.fa      2          lw:1.5
3.fa      3          lw:1.5
4.fa      4          lw:1.5
5.fa      5          lw:1.5
6.fa      6          lw:1.5
7.fa      7          lw:1.5
9.fa      9          lw:1.5

and I ran into the error:
ImportError: For chromosome ID: Chr0, length in genome fasta: genomes2.txt is less than the maximum coordinate in the structural annotation file: 1_2syri.out. Exiting.

I didn't understand the error. The first fasta file is the reference and therefore the longest. I don't really see maximum coordinate problems. Anyway, I saw that there was the possibility of using the chromosome lengths as input. So I calculated the length of each used fasta file and produced a chrlen file. Ofc I renamed the input files in genomes2.txt from .fa to .chrlen. The chrlen files look like this

"0.chrlen":
Chr0    5199559

"1.chrlen":
Chr0    5020588

and so on...

With that and the same plotsr command to start I run into the error:

ImportError: Chromosome ID: Chr0 in structural annotation file: 0_1syri.out not present in genome fasta: 0. Exiting

Could you explain to me, what I am doing wrong?
Thanks!

P.S.: thanks for reading until here. I think I found an error in your example with the current explanation in the README file. The chosen fonts in the example files markers.bed and tracks.txt are Arial. I think this is not supported anymore (?). Anyway, I changed it to DejaVu Sans and it worked again. Thought you might know :)

Answer 1 · 2023-06-28T08:29:45.000Z

Update: i ordered the genomes.txt file numerically so that it is

#file name tags
0.fa 0 lw:1.5
1.fa 1 lw:1.5
2.fa 2 lw:1.5
...

with the chrlen files I still run into the same error as before, but with the fast files it actually worked!!

Answer 2 · 2023-06-28T10:46:58.000Z

For future reference: genomes.txt requires genomes to be in same order in which they are analysed. Also, to use chrlen files, use ft:cl tag.