CAMI-challenge/CAMISIM

camisim encountered an error when specifying the --hmm_model parameter in pbsim2

Closed this issue · 3 comments

Hi,
I modified script "readsimulationwrapper.py" to use pbsim2 to simulate pacbio CLR reads.

image

When I set "--sample-fastq" parameter to simulate pacbio HiFi reads, it work well. But when I set "--hmm_model" parameter , it encountered an error.

2024-01-10 19:32:12 INFO: [MetagenomeSimulationPipeline] Anonymize Data
2024-01-10 19:32:12 INFO: [FastaAnonymizer] Shuffle and anonymize '/tmp/tmpwe1mm89g/2024.01.10_19.31.13_sample_0/reads'
2024-01-10 19:32:12 INFO: [MetadataReader 32438872918] Reading file: '/home/work/wenhai/simulate_genome_data/PanTax/long_read/30_species/sim-30species-CLR/internal/genome_locations.tsv'
2024-01-10 19:32:14 INFO: [MetadataReader 75956233560] Reading file: '/home/work/wenhai/simulate_genome_data/PanTax/long_read/30_species/sim-30species-CLR/internal/meta_data.tsv'
2024-01-10 19:32:14 INFO: [MetadataReader 68827563347] Reading file: '/tmp/tmpwe1mm89g/tmpzpt6zdww'
2024-01-10 19:32:14 ERROR: [MetagenomeSimulationPipeline] Column '1' not found! in line 117
2024-01-10 19:32:14 INFO: [MetagenomeSimulationPipeline] Metagenome simulation aborted

How to deal with it?

Thank you in advance.

Hi, without knowing what exactly you changed this is very hard to tell, is the above screenshot the only position in the code you changed? Unfortunately, the PBsim version for which CAMISIM was implemented is quite old (i.e. not even PBsim2) and the output of PBsim may have changed in the meantime, so the subsequent scripts are likely to fail. The error you encounter stems from the anonymisation step, but it is my guess that something is wrong with the data before that already (e.g. there is no data here /tmp/tmpwe1mm89g/2024.01.10_19.31.13_sample_0/reads).
You can try running CAMISIM with the -debug option to get more details as to why CAMISIM may have crashed, but unfortunately changing only the snippet above is unlikely to work for PBsim2

Yes, that is the only position in the code I changed. And I ran CAMISIM with -debug and returned traceback information.

2024-01-11 09:56:36 INFO: [MetagenomeSimulationPipeline] Anonymize Data
2024-01-11 09:56:36 DEBUG: [MetagenomeSimulationPipeline] /tmp/tmps6g1lcba/tmpvchkrqdh
2024-01-11 09:56:36 INFO: [FastaAnonymizer] Shuffle and anonymize '/tmp/tmps6g1lcba/2024.01.11_09.55.39_sample_0/reads'
2024-01-11 09:56:36 DEBUG: [FastaAnonymizer] get_seeded_random() { seed="$1"; openssl enc -aes-256-ctr -pass pass:"$seed" -nosalt < /dev/zero 2>/dev/null; }; python3 '/home/wenhai/application/camisim/CAMISIM/fastastreamer.py' -input '/tmp/tmps6g1lcba/2024.01.11_09.55.39_sample_0/reads' -format 'fastq' -ext 'fq' -s | shuf -z --random-source=<(get_seeded_random 1890693497027095203) | tr -d '\000' | python3 '/home/wenhai/application/camisim/CAMISIM/anonymizer.py' -prefix 'S0R' -format 'fastq' -map '/tmp/tmps6g1lcba/tmp4j_w4a4u' -out '/tmp/tmps6g1lcba/tmpvic263sa' -s
2024-01-11 09:56:37 INFO: [MetadataReader 32438872918] Reading file: '/home/work/wenhai/simulate_genome_data/PanTax/long_read/30_species/sim-30species-CLR/internal/genome_locations.tsv'
2024-01-11 09:56:38 INFO: [MetadataReader 75956233560] Reading file: '/home/work/wenhai/simulate_genome_data/PanTax/long_read/30_species/sim-30species-CLR/internal/meta_data.tsv'
2024-01-11 09:56:38 INFO: [MetadataReader 68827563347] Reading file: '/tmp/tmps6g1lcba/tmp4j_w4a4u'
2024-01-11 09:56:38 DEBUG: [MetagenomeSimulationPipeline] 
Traceback (most recent call last):
  File "/home/wenhai/application/camisim/CAMISIM/metagenomesimulation.py", line 117, in run_pipeline
    self._anonymize_data(list_of_output_gsa, file_path_output_gsa_pooled)
  File "/home/wenhai/application/camisim/CAMISIM/metagenomesimulation.py", line 650, in _anonymize_data
    gs_mapping.gs_read_mapping(
  File "/home/wenhai/application/camisim/CAMISIM/scripts/GoldStandardFileFormat/goldstandardfileformat.py", line 368, in gs_read_mapping
    dict_anonymous_to_read_id = self.get_dict_anonymous_to_original_id(file_path_id_map)
  File "/home/wenhai/application/camisim/CAMISIM/scripts/GoldStandardFileFormat/goldstandardfileformat.py", line 142, in get_dict_anonymous_to_original_id
    dict_mapping = table.get_map(1, 0)
  File "/home/wenhai/application/camisim/CAMISIM/scripts/MetaDataTable/metadatatable.py", line 612, in get_map
    assert self.has_column(key_column_name), "Column '{}' not found!".format(key_column_name)
AssertionError: Column '1' not found!


2024-01-11 09:56:38 ERROR: [MetagenomeSimulationPipeline] Column '1' not found! in line 117
2024-01-11 09:56:38 INFO: [MetagenomeSimulationPipeline] Metagenome simulation aborted
2024-01-11 09:56:38 INFO: [MetagenomeSimulationPipeline] Temporary data stored at:
/tmp/tmps6g1lcba

Adding PBSim2 to CAMISIM is on my todo list (#131), but it has not been completed yet. Your error most likely occurs because of some changes in PBSim2 in comparison to PBSim1 (most likely in terms of output, e.g. file names or formats). Just changing the command line options and the executable unfortunately will not suffice to implement PBSim2 into CAMISIM.
My assumption is that there probably are some more errors "hidden" in the log, e.g. it is possible that PBSim2 did not simulate any reads at all which makes CAMISIM not finding files to anonymize. If you add the complete log I might be able to confirm this.
Note however that while I plan on implementing PBSim2, I will not give a concrete timeline as to when this is going to happen and if the error is not easily fixed and you don't want to wait, you would have to implement PBSim2 yourself (i.e. creating a new PBSim2 class inheriting from ReadSimulationWrapper and implementing the required methods) - I would love a PR for this.