Incorrect number of genomes detected in overview of output
Opened this issue · 3 comments
An incorrect number of genomes seems to be captured in the index HTML output. Thus, this affects one of the pie charts that is generated. In one case, when using one assembled genome with 11 BGCs, the overview in the index.html file incorrectly said that 11 genomes were used.
In another case, I used 15 assembled genomes but the overview page in the index html file said 118. The input is as indicated in the tutorial.
This issue has not seen activity for 14 days and has been marked as stale. Please comment with additional information if this issue is still relevant.
Hi. The amount of genomes for that figure is calculated using the header of the gbk files (from the Organism property, if I remember correctly), so it's possible to have incorrect numbers depending on how these gbk files were produced
Hi,
If the genome name is not in the organism property of the gbk, the considered name will be the name of the gbk file (without "cluster" or "region").
this happens here:
Lines 3271 to 3277 in 97d616c
If you are working with 3 clusters from the same genome (contig_1.region001.gbk
, contig_2.region001.gbk
, contig_3.region001.gbk
), the script will consider that there are 3 genomes... (please correct me if I am wrong)
I didn't have time to read the entire code, but I think you could adjust the name of your input before running bigscape, i.e. to include the genome name (genome1.region001.contig_1.gbk
, genome1.region001.contig_2.gbk
, genome1.region001.contig_3.gbk
).
Here is a bash script to include the genome name in the cluster.gbk files and create a symbolic link in the directory where the script is executed (input directory of bigscape):
#!/bin/bash
# Directory where the genome folders are located
genomes_dir="path_to_antiSMASH_output/"
# Loop through all genome folders
for genome_dir in "$genomes_dir"/*; do
# Extract the genome name from the folder
genome=$(basename "$genome_dir")
# Find all gbk files containing "region" in their name inside the genome folder
find "$genome_dir" -type f -name "*region*.gbk" | while read -r gbk_file; do
# Extract the file name without extension
filename=$(basename "$gbk_file" .gbk)
# Extract the region number from the file
region_number=$(echo "$filename" | grep -oP 'region\d+')
# Extract the contig number from the file
contig_number=$(echo "$filename" | grep -oP 'contig_\d+')
# Create the new file name
new_filename="${genome}.${region_number}.${contig_number}.gbk"
# Create the symbolic link with the new name in the current directory
ln -s "$gbk_file" "./$new_filename"
done
done
best,
Felipe