medema-group/BiG-SCAPE

Incorrect number of genomes detected in overview of output

Opened this issue · 3 comments

An incorrect number of genomes seems to be captured in the index HTML output. Thus, this affects one of the pie charts that is generated. In one case, when using one assembled genome with 11 BGCs, the overview in the index.html file incorrectly said that 11 genomes were used.
In another case, I used 15 assembled genomes but the overview page in the index html file said 118. The input is as indicated in the tutorial.

This issue has not seen activity for 14 days and has been marked as stale. Please comment with additional information if this issue is still relevant.

Hi. The amount of genomes for that figure is calculated using the header of the gbk files (from the Organism property, if I remember correctly), so it's possible to have incorrect numbers depending on how these gbk files were produced

Hi,

If the genome name is not in the organism property of the gbk, the considered name will be the name of the gbk file (without "cluster" or "region").

this happens here:

BiG-SCAPE/bigscape.py

Lines 3271 to 3277 in 97d616c

# get identifier info
identifier = ""
if len(bgc_info[bgc].organism) > 1:
identifier = bgc_info[bgc].organism
else : # use original genome file name (i.e. exclude "..clusterXXX from antiSMASH run")
file_name_base = os.path.splitext(os.path.basename(genbankDict[bgc][0]))[0]
identifier = file_name_base.rsplit(".cluster",1)[0].rsplit(".region", 1)[0]

If you are working with 3 clusters from the same genome (contig_1.region001.gbk, contig_2.region001.gbk, contig_3.region001.gbk), the script will consider that there are 3 genomes... (please correct me if I am wrong)

I didn't have time to read the entire code, but I think you could adjust the name of your input before running bigscape, i.e. to include the genome name (genome1.region001.contig_1.gbk, genome1.region001.contig_2.gbk, genome1.region001.contig_3.gbk).

Here is a bash script to include the genome name in the cluster.gbk files and create a symbolic link in the directory where the script is executed (input directory of bigscape):

#!/bin/bash

# Directory where the genome folders are located
genomes_dir="path_to_antiSMASH_output/"

# Loop through all genome folders
for genome_dir in "$genomes_dir"/*; do
    # Extract the genome name from the folder
    genome=$(basename "$genome_dir")

    # Find all gbk files containing "region" in their name inside the genome folder
    find "$genome_dir" -type f -name "*region*.gbk" | while read -r gbk_file; do
        # Extract the file name without extension
        filename=$(basename "$gbk_file" .gbk)
        # Extract the region number from the file
        region_number=$(echo "$filename" | grep -oP 'region\d+')

        # Extract the contig number from the file
        contig_number=$(echo "$filename" | grep -oP 'contig_\d+')

        # Create the new file name
        new_filename="${genome}.${region_number}.${contig_number}.gbk"

        # Create the symbolic link with the new name in the current directory
        ln -s "$gbk_file" "./$new_filename"
    done
done

best,
Felipe