B-UMMI/chewBBACA

Error when calling chewBBACA AlleleCall with --cds-input flag

TorHou opened this issue · 4 comments

This was only tested on version 3.1.0.
As the title suggests, when I want to skip the gene prediction with the --cds-input flag I run into this reproducible error.

Traceback (most recent call last):
  File "/home/houwaartt/.conda/envs/chewbacca/bin/chewBBACA.py", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/houwaartt/.conda/envs/chewbacca/lib/python3.11/site-packages/CHEWBBACA/chewBBACA.py", line 1584, in main
    functions_info[process][1]()
  File "/home/houwaartt/.conda/envs/chewbacca/lib/python3.11/site-packages/CHEWBBACA/utils/process_datetime.py", line 146, in wrapper
    func(*args, **kwargs)
  File "/home/houwaartt/.conda/envs/chewbacca/lib/python3.11/site-packages/CHEWBBACA/chewBBACA.py", line 514, in allele_call
    AlleleCall.main(genome_list, loci_list, args.schema_directory,
  File "/home/houwaartt/.conda/envs/chewbacca/lib/python3.11/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 2671, in main
    results = allele_calling(input_files, schema_directory, temp_directory,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/houwaartt/.conda/envs/chewbacca/lib/python3.11/site-packages/CHEWBBACA/AlleleCall/AlleleCall.py", line 2004, in allele_calling
    dna_dedup_results = cf.exclude_duplicates(cds_files, dna_dedup_dir,
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/houwaartt/.conda/envs/chewbacca/lib/python3.11/site-packages/CHEWBBACA/utils/core_functions.py", line 245, in exclude_duplicates
    cds_file = fo.concatenate_files(dedup_files, cds_file)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/houwaartt/.conda/envs/chewbacca/lib/python3.11/site-packages/CHEWBBACA/utils/file_operations.py", line 390, in concatenate_files
    with open(file, 'r') as infile:
         ^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'chewie_res_cds/results_20230309T133842/temp/3_cds_preprocess/cds_deduplication/distinct_cds_1.fasta'

I've tracked it down to this issue:
The program relies on a certain format in the fasta-ids. sequence_manipulation.py expects the ids to be <int>-protein<int>.

You get this format when the preceding Prodigal step is performed. When you leave it out with --cds, the fasta records can have very different id formats and the determine_distinct() function silently fails.

Greetings @TorHou,

Thank you for reporting this issue. The --cds-input option is indeed only working correctly if the headers of the sequences include protein and the unique genome identifier (obtained by splitting the basename of the input file based on . and keeping the first element from the returned list). It should work regardless of the sequence header format.
I've started working on a fix for this issue. chewBBACA will determine the unique genome identifier for each input/genome and rename the sequence headers to match the format chewBBACA expects, <genomeID>-protein<cdsID>. Suppose a file is named GCF_000007125.1_ASM712v1_cds_from_genomic.fna. In that case, its unique genome identifier will be GCF_000007125, and the sequence headers will be renamed to GCF_000007125-protein1, GCF_000007125-protein2, ..., GCF_000007125-proteinN (FASTA files with renamed headers will be stored in the temporary directory created by chewBBACA). The sequence headers will be renamed sequentially, so the integer after protein will indicate the order of the sequences in the original FASTA file. The fix will be included in release v3.1.2 and will be applied to the CreateSchema and AlleleCall modules. I'll let you know when it's available (I expect it to be soon).

Best regards,

Rafael

Hello @TorHou,

The latest release, chewBBACA v3.1.2, includes a fix for the issue you reported. You can install it through pip (awaiting approval in Bioconda).
Please let us know if you test it and if it solves the issue.

Rafael

It resolved the issue. Thank you very much!