[BUG] `KeyError: 'reference_database_name'` when running summarize

Question

[BUG] `KeyError: 'reference_database_name'` when running summarize

Closed this issue 2 years ago · 9 comments

Describe the bug

I get the following error when running with summarize

Warning: <_io.TextIOWrapper name='WAL001-megahit.mapping.potential.ARG.deeparg.json' mode='r' encoding='UTF-8'> report is empty
Traceback (most recent call last):
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/bin/hamronize", line 8, in <module>
    sys.exit(main())
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/hAMRonization/hamronize.py", line 7, in main
    hAMRonization.Interfaces.generic_cli_interface()
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/hAMRonization/Interfaces.py", line 299, in generic_cli_interface
    hAMRonization.summarize.summarize_reports(
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/hAMRonization/summarize.py", line 752, in summarize_reports
    combined_reports = combined_reports.sort_values(
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/pandas/util/_decorators.py", line 317, in wrapper
    return func(*args, **kwargs)
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/pandas/core/frame.py", line 6886, in sort_values
    keys = [self._get_label_or_level_values(x, axis=axis) for x in by]
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/pandas/core/frame.py", line 6886, in <listcomp>
    keys = [self._get_label_or_level_values(x, axis=axis) for x in by]
  File "/home/jfellows/ccdata/users/JFellows/bin/miniconda3/envs/hamronization/lib/python3.10/site-packages/pandas/core/generic.py", line 1849, in _get_label_or_level_values
    raise KeyError(key)
KeyError: 'reference_database_name'

Input

hamronize \
    summarize \
    <huge_list_of_jsons> \
    -t interactive \
     \
    -o hamronization_combined_report.html

Input file
I can send a zip of the entire privately if necessary (includes unpublished data)

Error log
See above

hAMRonization Version
1.1.0

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: SUSE Linux Enterprise High Performance Computing 15 SP1
Version: hAMRronization 1.1.0

Additional context
Add any other context about the problem here.
If applicable, include dependency versions such as pandas version and Python version.

Answer 1 · 2022-09-26T10:27:43.000Z

Ah damn, sorry, I thought I'd solved that issue/covered it with tests. I thought the concatenation should be adding those fields but I'll try initialising the empty combined dataframe earlier.

Does the same error occur if you run hamronize summarize on just WAL001-megahit.mapping.potential.ARG.deeparg.json or only with the big list of jsons?

If the former could you send me just that output file (to finlay.maguire@dal.ca) and if the latter the big ole zip?

Answer 2 · 2022-09-26T10:51:52.000Z

Yes, only WAL001-megahit, but strangely it seems to happen in all cases, e.g. VLC009-metaspades.mapping.potential.ARG.deeparg.json which does have hits.

I'll send you the zip and you can test everything. Happy to also test any dev versions!

Answer 3 · 2022-09-26T12:42:04.000Z

This seems to work now but please test in your workflow. Instead of trying to add additional columns if needed post-concatenation, I now just initialise an empty dataframe with all the headers in summarize before concatenating.

One question to make sure I haven't failed to fix another issue: These input jsons to summarize weren't cached and not regenerated with the hamronization v1.1.0 right? hamronization should now be generating valid empty jsons (i.e., just files containing []) when parsing empty tool reports but I see these files still have the ] malformation.

Answer 4 · 2022-09-26T12:49:25.000Z

OK! I will test this :)
Ah yes correct sorry, the JSONs in the ZIP were still from 1.0.3 - It took a few days for the pipeline to run, so didn't want to run the whole thing again with 1.1.0 to find the same/different summarize issue 😅 . I can try to take a few and re-generate them with 1.1.0 to double check now though

Answer 5 · 2022-09-26T13:08:55.000Z

@fmaguire I can confirm 761fe77 fixes the bug, and that re-running e.g. harmonizate deeparg on an 'empty' outputfile produces the correct empty JSON of [].

Once this version is released on bioconda (I sped that up for 1.1.0 this morning btw 😬 ), I will update our nf-core nextflow module and re-run the full pipeline again and let you know how well it performs.

This issue can be closed now!

Answer 6 · 2022-09-26T13:19:25.000Z

oh and thanks for the quick turnaround :D

Answer 7 · 2022-09-26T13:39:26.000Z

Great! Thanks for your patience!

It should already be automatically updating on pypi, dockerhub, and (pending the attentiveness of bioconda bot) updated on bioconda at some point today.

Answer 8 · 2022-09-26T13:40:10.000Z

(although it does seem the badges on the README aren't updating for some reason...)

Answer 9 · 2022-09-26T16:53:54.000Z

Updated in bioconda now: bioconda/bioconda-recipes#37140