B-UMMI/chewBBACA

Problem: Unable to get paralogous_counts.tsv and run Paralog detection using step-by-step tutorial

luobosi opened this issue · 3 comments

Hello developers, the problem I encountered is described as follows:

I'm trying to use (https:// zenodo.org/record/7129300#.Yzb26_fTUUE) the dataset and command line code for the relevant step-by-step tutorial provided at this link (https://chewbbaca.readthedocs.io/en/latest/user/tutorials/chewie_step_by_step.html). When running to the second step: Allele calling, I found that the output directory corresponding to the path /chewbbaca/test/chewBBACA_tutorial/results32_wgMLST/results_20221122T214136/ only contains the following files: logging_info.txt, RepeatedLoci.txt, results_alleles.tsv, results_contigsInfo.tsv , results_statistics.tsv; that is to say, I did not find the paralogous_counts.tsv and paralogous_loci.tsv files that prompts in the output tutorial. Therefore, in the third step of Paralog detection, an error message prompts me that paralogous_counts.tsv does not exist.

And when I checked the expected results already in the data set, I also found that there were only the above five files in the path /chewbbaca/test/chewBBACA_tutorial/expected_results/Allele_calling/results32_wgMLST/results_20210429T125638/, no paralogous_counts.tsv, paralogous_loci .tsv. I checked the results_alleles.tsv files in the two directories, and found that the loci number corresponding to each genome was 3128, and in the log of Allele calling operation, I saw that there were 20 paralogs in the last step. But I cannot get the specific paralogous_counts.tsv file, nor can I perform the Paralog detection step. I tried to skip this step and directly use results_alleles.tsv as the input content of the subsequent steps, and found that except for some data differences, the code and results can run normally.

So I want to know, how can I get the paralogous_counts.tsv, paralogous_loci.tsv files and run the Paralog detection step successfully? And if I skip this step and continue to run the rest steps, whether the cgMLST scheme is convincing and based enough (I want to construct a cgMLST scheme for a certain species to analyze the population structure and phylogenetic relationship of the corresponding species, and try to use scheme to set thresholds for different hierarchical divisions within species)?

The software version I use is chewBBACA 2.8.5
Here are all dependencies content in my conda environment:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
biopython                 1.80            py311hd4cff14_0    conda-forge
blast                     2.13.0               hf3cf87c_0    bioconda
brotlipy                  0.7.0           py311hd4cff14_1005    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.18.1               h7f98852_0    conda-forge
ca-certificates           2022.9.24            ha878542_0    conda-forge
certifi                   2022.9.24          pyhd8ed1ab_0    conda-forge
cffi                      1.15.1          py311h409f033_2    conda-forge
charset-normalizer        2.1.1              pyhd8ed1ab_0    conda-forge
chewbbaca                 2.8.5              pyhdfd78af_0    bioconda
clustalw                  2.1                  h9f5acd7_7    bioconda
cryptography              38.0.3          py311h42a1071_0    conda-forge
curl                      7.86.0               h2283fc2_1    conda-forge
entrez-direct             16.2                 he881be0_1    bioconda
gettext                   0.21.1               h27087fc_0    conda-forge
html5lib                  1.1                pyh9f0ad1d_0    conda-forge
idna                      3.4                pyhd8ed1ab_0    conda-forge
importlib-metadata        5.0.0              pyha770c72_1    conda-forge
isodate                   0.6.1              pyhd8ed1ab_0    conda-forge
keyutils                  1.6.1                h166bdaf_0    conda-forge
krb5                      1.19.3               h08a2579_0    conda-forge
ld_impl_linux-64          2.39                 hcc3a1bd_1    conda-forge
libblas                   3.9.0           16_linux64_openblas    conda-forge
libcblas                  3.9.0           16_linux64_openblas    conda-forge
libcurl                   7.86.0               h2283fc2_1    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 12.2.0              h65d4601_19    conda-forge
libgfortran-ng            12.2.0              h69a702a_19    conda-forge
libgfortran5              12.2.0              h337968e_19    conda-forge
libgomp                   12.2.0              h65d4601_19    conda-forge
libidn2                   2.3.4                h166bdaf_0    conda-forge
liblapack                 3.9.0           16_linux64_openblas    conda-forge
libnghttp2                1.47.0               hff17c54_1    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libopenblas               0.3.21          pthreads_h78a6416_3    conda-forge
libsqlite                 3.40.0               h753d276_0    conda-forge
libssh2                   1.10.0               hf14f497_3    conda-forge
libstdcxx-ng              12.2.0              h46fd767_19    conda-forge
libunistring              0.9.10               h7f98852_0    conda-forge
libuuid                   2.32.1            h7f98852_1000    conda-forge
libzlib                   1.2.13               h166bdaf_4    conda-forge
mafft                     7.508                hec16e2b_0    bioconda
ncurses                   6.3                  h27087fc_1    conda-forge
numpy                     1.23.5          py311h7d28db0_0    conda-forge
openssl                   3.0.7                h166bdaf_0    conda-forge
pandas                    1.5.1           py311h8b32b4d_1    conda-forge
pcre                      8.45                 h9c3ff4c_0    conda-forge
perl                      5.32.1          2_h7f98852_perl5    conda-forge
perl-archive-tar          2.40            pl5321hdfd78af_0    bioconda
perl-carp                 1.38            pl5321hdfd78af_4    bioconda
perl-common-sense         3.75            pl5321hdfd78af_0    bioconda
perl-compress-raw-bzip2   2.201           pl5321h87f3376_1    bioconda
perl-compress-raw-zlib    2.105           pl5321h87f3376_0    bioconda
perl-encode               3.19            pl5321hec16e2b_1    bioconda
perl-exporter             5.72            pl5321hdfd78af_2    bioconda
perl-exporter-tiny        1.002002        pl5321hdfd78af_0    bioconda
perl-extutils-makemaker   7.64            pl5321hd8ed1ab_0    conda-forge
perl-io-compress          2.201           pl5321h87f3376_0    bioconda
perl-io-zlib              1.11            pl5321hdfd78af_0    bioconda
perl-json                 4.10            pl5321hdfd78af_0    bioconda
perl-json-xs              2.34            pl5321h9f5acd7_5    bioconda
perl-list-moreutils       0.430           pl5321hdfd78af_0    bioconda
perl-list-moreutils-xs    0.430           pl5321hec16e2b_1    bioconda
perl-parent               0.236           pl5321hdfd78af_2    bioconda
perl-pathtools            3.75            pl5321hec16e2b_3    bioconda
perl-scalar-list-utils    1.62            pl5321hec16e2b_1    bioconda
perl-types-serialiser     1.01            pl5321hdfd78af_0    bioconda
pip                       22.3.1             pyhd8ed1ab_0    conda-forge
plotly                    5.11.0             pyhd8ed1ab_0    conda-forge
prodigal                  2.6.3                hec16e2b_4    bioconda
pycparser                 2.21               pyhd8ed1ab_0    conda-forge
pyopenssl                 22.1.0             pyhd8ed1ab_0    conda-forge
pyparsing                 3.0.9              pyhd8ed1ab_0    conda-forge
pysocks                   1.7.1              pyha2e5f31_6    conda-forge
python                    3.11.0          ha86cf86_0_cpython    conda-forge
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python_abi                3.11                    3_cp311    conda-forge
pytz                      2022.6             pyhd8ed1ab_0    conda-forge
rdflib                    6.2.0              pyhd8ed1ab_0    conda-forge
readline                  8.1.2                h0f457ee_0    conda-forge
requests                  2.28.1             pyhd8ed1ab_1    conda-forge
scipy                     1.9.3           py311h69910c8_2    conda-forge
setuptools                65.5.1             pyhd8ed1ab_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
sparqlwrapper             2.0.0              pyha770c72_0    conda-forge
tenacity                  8.1.0              pyhd8ed1ab_0    conda-forge
tk                        8.6.12               h27826a3_0    conda-forge
tzdata                    2022f                h191b570_0    conda-forge
urllib3                   1.26.11            pyhd8ed1ab_0    conda-forge
webencodings              0.5.1                      py_1    conda-forge
wget                      1.20.3               ha35d2d1_1    conda-forge
wheel                     0.38.4             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
zipp                      3.10.0             pyhd8ed1ab_0    conda-forge
zlib                      1.2.13               h166bdaf_4    conda-forge

This is the command I used, as in the tutorial:

chewBBACA.py CreateSchema -i genomes/complete_genomes/ -o tutorial_schema --ptf Streptococcus_agalactiae.trn --cpu 18
chewBBACA.py AlleleCall -i genomes/complete_genomes/ -g tutorial_schema/schema_seed -o results32_wgMLST --cpu 18 --ptf Streptococcus_agalactiae.trn  
chewBBACA.py RemoveGenes -i results32_wgMLST/results_20210429T125638/results_alleles.tsv -g results32_wgMLST/results_20210429T125638/paralogous_counts.tsv -o results32_wgMLST/results_20210429T125638/results_alleles_NoParalogs.tsv

This is the output log of my running Allele calling step:

chewBBACA version: 2.8.5
Authors: Mickael Silva, Pedro Cerqueira, Rafael Mamede
Github: https://github.com/B-UMMI/chewBBACA
Wiki: https://github.com/B-UMMI/chewBBACA/wiki
Tutorial: https://github.com/B-UMMI/chewBBACA_tutorial
Contacts: imm-bioinfo@medicina.ulisboa.pt

==========================
  chewBBACA - AlleleCall
==========================
Started at: 2022-11-22T21:27:10

Checking if genome files exist...
Checking if gene files exist...

Starting Prodigal at: 21:27:52-22/11/2022
Training file: Streptococcus_agalactiae.trn
 [====================] 100%
Finishing Prodigal at: 21:27:57-22/11/2022

Checking if Prodigal created all the necessary files...
All files were created.

Translating genomes...
Creating Blast databases for all genomes...

Starting Allele Calling at: 21:27:59-22/11/2022
Processing GCA-000007265-protein1197.fasta. Start 21:40:12-22/11/2022 Locus 3127 of 3128. Done 99%.
Finished Allele Calling at: 21:40:21-22/11/2022

Wrapping up the results...
##################################################
 32 genomes used for 3128 loci

 Used a BSR of: 0.6

 56488 exact matches found out of 100096

 56.43 percent of exact matches
##################################################

Writing output files...

------------------------------------------------------------------------------------------
Genome                                      EXC    INF    LNF   PLOT   NIPH    ALM    ASM
------------------------------------------------------------------------------------------
GCA_000007265.1_ASM726v1_genomic.fna       1885     0    1204     0     16      4     19
GCA_000012705.1_ASM1270v1_genomic.fna      1857     0    1242     0      6      5     18
GCA_000196055.1_ASM19605v1_genomic.fna     1830     0    1227     0     57      2     12
GCA_000299135.1_ASM29913v1_genomic.fna     1776     0    1324     0      7      4     17
GCA_000302475.2_ASM30247v2_genomic.fna     1551     0    1515     0      6      2     54
GCA_000427035.1_09mas018883_genomic.fna    1878     0    1223     0      8      6     13
GCA_000427055.1_ILRI112_genomic.fna        1667     0    1384     0     11      1     65
GCA_000427075.1_ILRI005_genomic.fna        1816     0    1272     0     15      2     23
GCA_000599965.1_ASM59996v1_genomic.fna     1518     0    1542     0      5      2     61
GCA_000636115.1_ASM63611v1_genomic.fna     1515     0    1545     0      5      2     61
GCA_000689235.1_GBCO_p1_genomic.fna        1770     0    1324     0     12      2     20
GCA_000730215.2_ASM73021v2_genomic.fna     1957     0    1142     0      9      5     15
GCA_000730255.1_ASM73025v1_genomic.fna     1803     0    1291     0     10      3     21
GCA_000782855.1_ASM78285v1_genomic.fna     1806     0    1297     0      7      5     13
GCA_000831105.1_ASM83110v1_genomic.fna     1841     0    1244     0     20      8     15
GCA_000831125.1_ASM83112v1_genomic.fna     1846     0    1244     0     21      5     12
GCA_000831145.1_ASM83114v1_genomic.fna     1858     0    1229     0     20      8     13
GCA_000967445.1_ASM96744v1_genomic.fna     1523     0    1539     0      5      2     59
GCA_001026925.1_ASM102692v1_genomic.fna    1833     0    1266     0      7      5     17
GCA_001190805.1_ASM119080v1_genomic.fna    1777     0    1322     0      7      4     18
GCA_001190825.1_ASM119082v1_genomic.fna    1765     0    1333     0      7      4     19
GCA_001190845.1_ASM119084v1_genomic.fna    1773     0    1326     0      7      4     18
GCA_001190865.1_ASM119086v1_genomic.fna    1519     0    1541     0      7      2     59
GCA_001190885.1_ASM119088v1_genomic.fna    1840     0    1250     0     10      5     23
GCA_001266635.1_ASM126663v1_genomic.fna    1782     0    1314     0     13      5     14
GCA_001275545.2_ASM127554v2_genomic.fna    1845     0    1239     0     24      5     15
GCA_001448985.1_ASM144898v1_genomic.fna    1885     0    1208     0     14      7     14
GCA_001552035.1_ASM155203v1_genomic.fna    1750     0    1340     0     12      2     24
GCA_001592385.1_ASM159238v1_genomic.fna    1852     0    1250     0      6      5     15
GCA_001592425.1_ASM159242v1_genomic.fna    1790     0    1313     0      6      5     14
GCA_001655175.1_ASM165517v1_genomic.fna    1530     0    1532     0      6      2     58
GCA_001683515.1_ASM168351v1_genomic.fna    1850     0    1253     0      6      5     14
------------------------------------------------------------------------------------------

Checking the existence of paralog genes...
Detected number of paralog loci: 20

Finished at: 2022-11-22T21:41:39
Took  14m 29s.

This is the error when I run Paralog detection:

chewBBACA version: 2.8.5
Authors: Mickael Silva, Pedro Cerqueira, Rafael Mamede
Github: https://github.com/B-UMMI/chewBBACA
Wiki: https://github.com/B-UMMI/chewBBACA/wiki
Tutorial: https://github.com/B-UMMI/chewBBACA_tutorial
Contacts: imm-bioinfo@medicina.ulisboa.pt

===========================
  chewBBACA - RemoveGenes
===========================
Started at: 2022-11-22T16:09:08

Traceback (most recent call last):
  File "/beegfs/home/wtq/miniconda3/envs/chew/bin/chewBBACA.py", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/beegfs/home/wtq/miniconda3/envs/chew/lib/python3.11/site-packages/CHEWBBACA/chewBBACA.py", line 1480, in main
    functions_info[process][1]()
  File "/beegfs/home/wtq/miniconda3/envs/chew/lib/python3.11/site-packages/CHEWBBACA/utils/process_datetime.py", line 149, in wrapper
    func(*args, **kwargs)
  File "/beegfs/home/wtq/miniconda3/envs/chew/lib/python3.11/site-packages/CHEWBBACA/chewBBACA.py", line 810, in remove_genes
    RemoveGenes.main(**vars(args))
  File "/beegfs/home/wtq/miniconda3/envs/chew/lib/python3.11/site-packages/CHEWBBACA/utils/RemoveGenes.py", line 52, in main
    with open(genes_list, 'r') as infile:
         ^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'results32_wgMLST/results_20221122T160319/paralogous_counts.tsv'

Looking forward to your suggestions and responses and thank you for it.
PS: My English is not particularly good, so apologies for the relatively long-winded problem description.

Hello @luobosi,

Thank you for your interest in chewBBACA. The documentation and the tutorial files are being updated for the release of version 3 and some parts might not match what's created by version 2.8.5. I apologize for the inconsistencies.
chewBBACA v2.8.5 writes the list of paralogous loci to the RepeatedLoci.txt file. You can provide that file to the RemoveGenes module to remove the list/columns of paralogous loci from the allele call results. The subsequent analyses will complete successfully if you do not remove the paralogous loci, but it is advisable to remove those loci to avoid some ambiguous classifications that might influence the downstream analysis (paralogous loci are detected through the identification of distinct coding sequences in each input genome that match multiple loci).
The documentation and the files in Zenodo will be updated soon.
Let me know if you have further questions.

Rafael

Hello @luobosi,

We've updated the documentation and the files in Zenodo. The latest tutorial dataset should include all the necessary files. The tutorial instructions have also been updated (we'll keep updating them to add more detail, but it currently includes all the main steps). Let us know if you find any issues or if you have any suggestions.

Rafael

Closing this issue due to inactivity. @luobosi, let us know if you need more information about this or other subjects.