Chromosome ID not present in genome fasta
adriandrr opened this issue · 2 comments
Hey, I am currently trying to run your syntheny plot pipeline. I would be really glad if you could help me.
I am trying to create the syntheny between 12 bacteria de-novo generated whole genomes.
I mapped them with minimap A+B, B+C, C+D and so on. I used a bash script for that if you want to have a look:
#!/bin/bash
fasta_files=(0.fa 1.fa 2.fa 3.fa 4.fa 5.fa 6.fa 7.fa 9.fa 10.fa 11.fa 12.fa 13.fa)
for ((i = 0; i < ${#fasta_files[@]} - 1; i++)); do
current_file="${fasta_files[$i]}"
next_file="${fasta_files[$i + 1]}"
current_prefix="${current_file%.}"
next_prefix="${next_file%.}"
output_bam="${current_prefix}${next_prefix}.bam"
minimap2 -ax asm5 -t 4 --eqx "$current_file" "$next_file" | samtools sort -O BAM - > "$output_bam"
samtools index "$output_bam"
done
After that I used a bash oneliner for-loop to produce the syri information:
for i in $(ls bam -1v); do prefix="${i%.}";IFS="" read -r fnum snum <<< "$prefix"; syri -c $i -r $fnum.fa -q $snum.fa -F B --prefix $prefix ;done
I am unsure if there is a problem with the syri information since I am not very familiar with that. The first 5 lines of the first syri output "0_1syri.out" look like this:
Chr0 1 4344 - - - - - NOTAL1 - NOTAL -
Chr0 4345 5189076 - - Chr0 1 5020588 SYN1 - SYN -
Chr0 4345 4896 - - Chr0 1 553 SYNAL1 SYN1 SYNAL -
Chr0 4484 4484 G T Chr0 140 140 SNP543 SYN1 SNP -
Chr0 4485 4485 A T Chr0 141 141 SNP544 SYN1 SNP -
what I now tried is to start plotsr with this command:
plotsr --sr 0_1syri.out --sr 1_2syri.out --sr 2_3syri.out --sr 3_4syri.out --sr 4_5syri.out --sr 5_6syri.out --sr 6_7syri.out --sr 7_9syri.out --sr 9_10syri.out --sr 10_11syri.out --sr 11_12syri.out --sr 12_13syri.out --genomes ../../genomes2.txt -o output_plot.png
First I wanted to use the main fasta files as input whereas the genomes2.txt file looked like that:
#file name tags
0.fa 0 lw:1.5
1.fa 1 lw:1.5
10.fa 10 lw:1.5
11.fa 11 lw:1.5
12.fa 12 lw:1.5
13.fa 13 lw:1.5
2.fa 2 lw:1.5
3.fa 3 lw:1.5
4.fa 4 lw:1.5
5.fa 5 lw:1.5
6.fa 6 lw:1.5
7.fa 7 lw:1.5
9.fa 9 lw:1.5
and I ran into the error:
ImportError: For chromosome ID: Chr0, length in genome fasta: genomes2.txt is less than the maximum coordinate in the structural annotation file: 1_2syri.out. Exiting.
I didn't understand the error. The first fasta file is the reference and therefore the longest. I don't really see maximum coordinate problems. Anyway, I saw that there was the possibility of using the chromosome lengths as input. So I calculated the length of each used fasta file and produced a chrlen file. Ofc I renamed the input files in genomes2.txt from .fa to .chrlen. The chrlen files look like this
"0.chrlen":
Chr0 5199559
"1.chrlen":
Chr0 5020588
and so on...
With that and the same plotsr command to start I run into the error:
ImportError: Chromosome ID: Chr0 in structural annotation file: 0_1syri.out not present in genome fasta: 0. Exiting
Could you explain to me, what I am doing wrong?
Thanks!
P.S.: thanks for reading until here. I think I found an error in your example with the current explanation in the README file. The chosen fonts in the example files markers.bed and tracks.txt are Arial. I think this is not supported anymore (?). Anyway, I changed it to DejaVu Sans and it worked again. Thought you might know :)
Update: i ordered the genomes.txt file numerically so that it is
#file name tags
0.fa 0 lw:1.5
1.fa 1 lw:1.5
2.fa 2 lw:1.5
...
with the chrlen files I still run into the same error as before, but with the fast files it actually worked!!
For future reference: genomes.txt requires genomes to be in same order in which they are analysed. Also, to use chrlen files, use ft:cl
tag.