metagenome-atlas/atlas

Error in rule get_all_modules

Closed this issue · 12 comments

  • I checked and didn't found a related issue,e.g. while typing the title
  • ** I got an error in the following rule(s):**
  • I checked the log files indicated indicated in the error message (and the cluster logs if submitted to a cluster)

Here is the relevant log output:

2023-05-13 04:39:11 Uncaught exception: Traceback (most recent call last):
  File "/projects/com_perkinsd/common/qc-antibiotics-atlas/.snakemake/scripts/tmp5iy1gxw3.DRAM_get_all_modules.py", line 58, in <module>
    module_steps_form = pd.read_csv(
  File "/projects/com_perkinsd/common/databases/conda_envs/a779e7ab5b5ee88b6a071a9705d2d44a_/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/projects/com_perkinsd/common/databases/conda_envs/a779e7ab5b5ee88b6a071a9705d2d44a_/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/projects/com_perkinsd/common/databases/conda_envs/a779e7ab5b5ee88b6a071a9705d2d44a_/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/projects/com_perkinsd/common/databases/conda_envs/a779e7ab5b5ee88b6a071a9705d2d44a_/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 605, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/projects/com_perkinsd/common/databases/conda_envs/a779e7ab5b5ee88b6a071a9705d2d44a_/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1442, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/projects/com_perkinsd/common/databases/conda_envs/a779e7ab5b5ee88b6a071a9705d2d44a_/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1735, in _make_engine
    self.handles = get_handle(
  File "/projects/com_perkinsd/common/databases/conda_envs/a779e7ab5b5ee88b6a071a9705d2d44a_/lib/python3.10/site-packages/pandas/io/common.py", line 713, in get_handle
    ioargs = _get_filepath_or_buffer(
  File "/projects/com_perkinsd/common/databases/conda_envs/a779e7ab5b5ee88b6a071a9705d2d44a_/lib/python3.10/site-packages/pandas/io/common.py", line 451, in _get_filepath_or_buffer
    raise ValueError(msg)
ValueError: Invalid file path or buffer object type: <class 'NoneType'>




** Atlas version**

Additional context
Add any other context about the problem here.

For context, I ran the pipeline on a completely unrelated dataset and got the same errors, as well as issue #653 #654.

It is probable that I have an error in my code. Could you please run atlas on the test data. It worked on my side. https://zenodo.org/record/3992790/files/test_reads.tar.gz

Could you also check your genomes/annotations/dram/annotations.tsv

Here is the head of `genomes/annotations/dram/annotations.tsv`:

	gene_position	rank	strandedness	end_position	start_position	fasta	scaffold	heme_regulatory_motif_count
MAG18_MAG18_1_1	1	E	1	204	1	MAG18	MAG18_1	0
MAG18_MAG18_1_2	2	E	1	902	207	MAG18	MAG18_1	0
MAG18_MAG18_1_3	3	E	-1	3135	1042	MAG18	MAG18_1	0
MAG18_MAG18_1_4	4	E	-1	3659	3393	MAG18	MAG18_1	0
MAG18_MAG18_1_5	5	E	-1	3999	3811	MAG18	MAG18_1	0
MAG18_MAG18_10_1	1	E	1	659	30	MAG18	MAG18_10	0
MAG18_MAG18_10_2	2	E	1	1319	885	MAG18	MAG18_10	0
MAG18_MAG18_10_3	3	E	1	1996	1316	MAG18	MAG18_10	0
MAG18_MAG18_10_4	4	E	1	3796	2396	MAG18	MAG18_10	0

I received the same DRAM errors as before. However, no errors on the genecatalog side of things. Im going to attempt to re-download the dram database

I'm seeing this on atlas v2.15.0 also. I think it may be related to dram not getting the dram configuration file which specifies all the resources required. I tried setting DRAM_CONFIG_LOCATION to DRAM/DRAM.config under the database_dir set in my atlas config file and that bypassed the first error reported here (ValueError: Invalid file path or buffer object type: <class 'NoneType'>).

Now I instead run into

2023-05-21 08:23:55 Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.2023-05-21 08:24:19 Uncaught exception: Traceback (most recent call last):  File "/crex/proj/snic2020-5-486/nobackup/SMS-23-6668-micegut/resources/conda_envs/9f41e817c598c12d8afe52ac2a7750e1_/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3652, in get_loc    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 147, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 176, in pandas._libs.index.IndexEngine.get_loc  File "pandas/_libs/hashtable_class_helper.pxi", line 7080, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 7088, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'ko_id'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):  File "/crex/proj/snic2020-5-486/nobackup/SMS-23-6668-micegut/atlas/.snakemake/scripts/tmpfnhn5496.DRAM_get_all_modules.py", line 67, in <module>    module_coverage_frame = make_module_coverage_frame(  File "/crex/proj/snic2020-5-486/nobackup/SMS-23-6668-micegut/resources/conda_envs/9f41e817c598c12d8afe52ac2a7750e1_/lib/python3.10/site-packages/mag_annotator/summarize_genomes.py", line 340, in make_module_coverage_frame    module_coverage_dict[group] = make_module_coverage_df(frame, module_nets)  File "/crex/proj/snic2020-5-486/nobackup/SMS-23-6668-micegut/resources/conda_envs/9f41e817c598c12d8afe52ac2a7750e1_/lib/python3.10/site-packages/mag_annotator/summarize_genomes.py", line 319, in make_module_coverage_df    for gene_id, ko_list in annotation_df[ko_id_name].items():  File "/crex/proj/snic2020-5-486/nobackup/SMS-23-6668-micegut/resources/conda_envs/9f41e817c598c12d8afe52ac2a7750e1_/lib/python3.10/site-packages/pandas/core/frame.py", line 3761, in __getitem__    indexer = self.columns.get_loc(key)  File "/crex/proj/snic2020-5-486/nobackup/SMS-23-6668-micegut/resources/conda_envs/9f41e817c598c12d8afe52ac2a7750e1_/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3654, in get_loc    raise KeyError(key) from err
KeyError: 'ko_id'

I think DRAM expects a kegg_id or ko_id column in the annotation file. The dram log has No KEGG source provided so distillation will be of limited use. so I guess the missing dram config environment variable was causing issues upstream of get_all_modules. I'm trying to rerun the dram annotation steps to see if I get kegg ids included in the output.

Passing the configuration file to the DRAM_annotate, DRAM_destill and get_all_modules rules fixes the issue for me.

See PR #658

yotsa commented

I ran into this today. How do I pass the config file directly to those rules?

I merged @johnne pull request.
@yotsa if you simply install atlas from the github the problem should be fixed.

I will test it before making a conda release.

There was no activity since some time. I hope your issue is solved in the mean time.
This issue will automatically close soon if no further activity occurs.

Thank you for your contributions.