biocore/evident

bool category not properly parsed

sjanssen2 opened this issue · 4 comments

Assume I have a metadata category like infection with values TRUE or FALSE. If I load these data as in your example metadata = pd.read_table("data/metadata.tsv", sep="\t", index_col=0) they are of type object and proper boolean values, i.e. True and False. If I would add a dtype=str, the values are still of type object but strings, namely 'TRUE' and 'FALSE'.

Only the dtype=str way works for me. Otherwise evident throws the error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3628             try:
-> 3629                 return self._engine.get_loc(casted_key)
   3630             except KeyError as err:

~/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

~/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'infection'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_1855806/1849832882.py in <module>
      1 for cat in ["birth_timestamp","cage","genotype","infection"]:
----> 2     print(adh.calculate_effect_size(column=cat))

~/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/evident/data_handler.py in calculate_effect_size(self, column, difference)
    112         :rtype: evident.results.EffectSizeResult
    113         """
--> 114         if self.metadata[column].dtype != np.dtype("object"):
    115             raise exc.NonCategoricalColumnError(self.metadata[column])
    116 

~/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3503             if self.columns.nlevels > 1:
   3504                 return self._getitem_multilevel(key)
-> 3505             indexer = self.columns.get_loc(key)
   3506             if is_integer(indexer):
   3507                 indexer = [indexer]

~/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3629                 return self._engine.get_loc(casted_key)
   3630             except KeyError as err:
-> 3631                 raise KeyError(key) from err
   3632             except TypeError:
   3633                 # If we have a listlike key, _check_indexing_error will raise

KeyError: 'infection'

You might want to return a more explicit error message in those cases.

same seems to be the case for dates

same issue might affect with the Bokeh server?!

(qiime2-2022.8) t490s x86_64 /media/jlu/vol/jlab/MicrobiomeAnalyses/Projects/Pandyra_LCMV>bokeh serve --show app
2023-01-09 17:21:02,339 Starting Bokeh server version 2.4.3 (running on Tornado 6.2)
2023-01-09 17:21:02,342 User authentication hooks NOT provided (default user enabled)
2023-01-09 17:21:02,349 Bokeh app running at: http://localhost:5006/app
2023-01-09 17:21:02,350 Starting Bokeh server with process id: 25929
/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/evident/data_handler.py:72: UserWarning: Some categories have been dropped because they had either only one level or too many. Use the max_levels_per_category argument to modify this threshold.
Dropped columns: ['birth_timestamp', 'host_age', 'infection', 'mouse_number']
  warn(
2023-01-09 17:21:04,054 Error running application handler <bokeh.application.handlers.directory.DirectoryHandler object at 0x7fdca1155dc0>: 'infection'
File 'base.py', line 3631, in get_loc:
raise KeyError(key) from err Traceback (most recent call last):
  File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3629, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'infection'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/bokeh/application/handlers/code_runner.py", line 231, in run
    exec(self._code, module.__dict__)
  File "/media/jlu/vol/jlab/MicrobiomeAnalyses/Projects/Pandyra_LCMV/app/main.py", line 48, in <module>
    effect_size_by_category(dh, binary_cols)
  File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/evident/effect_size.py", line 49, in effect_size_by_category
    results = Parallel(n_jobs=n_jobs, **parallel_args)(
  File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/joblib/parallel.py", line 1046, in __call__
    while self.dispatch_one_batch(iterator):
  File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/joblib/parallel.py", line 861, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/joblib/parallel.py", line 779, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/joblib/parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/joblib/parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/evident/data_handler.py", line 114, in calculate_effect_size
    if self.metadata[column].dtype != np.dtype("object"):
  File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3505, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3631, in get_loc
    raise KeyError(key) from err
KeyError: 'infection'
 
2023-01-09 17:21:04,487 WebSocket connection opened
2023-01-09 17:21:04,487 ServerConnection created
^C
Interrupted, shutting down

Thanks for bringing this up. I'm not sure how to handle dates but for booleans I think we can just allow bool dtype columns.

@sjanssen2 Can you try out this change and see if it resolves your boolean issue?

https://github.com/gibsramen/evident/tree/fix-bool