bool category not properly parsed
sjanssen2 opened this issue · 4 comments
Assume I have a metadata category like infection
with values TRUE
or FALSE
. If I load these data as in your example metadata = pd.read_table("data/metadata.tsv", sep="\t", index_col=0)
they are of type object
and proper boolean values, i.e. True
and False
. If I would add a dtype=str
, the values are still of type object
but strings, namely 'TRUE'
and 'FALSE'
.
Only the dtype=str
way works for me. Otherwise evident throws the error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3628 try:
-> 3629 return self._engine.get_loc(casted_key)
3630 except KeyError as err:
~/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
~/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'infection'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
/tmp/ipykernel_1855806/1849832882.py in <module>
1 for cat in ["birth_timestamp","cage","genotype","infection"]:
----> 2 print(adh.calculate_effect_size(column=cat))
~/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/evident/data_handler.py in calculate_effect_size(self, column, difference)
112 :rtype: evident.results.EffectSizeResult
113 """
--> 114 if self.metadata[column].dtype != np.dtype("object"):
115 raise exc.NonCategoricalColumnError(self.metadata[column])
116
~/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
3503 if self.columns.nlevels > 1:
3504 return self._getitem_multilevel(key)
-> 3505 indexer = self.columns.get_loc(key)
3506 if is_integer(indexer):
3507 indexer = [indexer]
~/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3629 return self._engine.get_loc(casted_key)
3630 except KeyError as err:
-> 3631 raise KeyError(key) from err
3632 except TypeError:
3633 # If we have a listlike key, _check_indexing_error will raise
KeyError: 'infection'
You might want to return a more explicit error message in those cases.
same seems to be the case for dates
same issue might affect with the Bokeh server?!
(qiime2-2022.8) t490s x86_64 /media/jlu/vol/jlab/MicrobiomeAnalyses/Projects/Pandyra_LCMV>bokeh serve --show app
2023-01-09 17:21:02,339 Starting Bokeh server version 2.4.3 (running on Tornado 6.2)
2023-01-09 17:21:02,342 User authentication hooks NOT provided (default user enabled)
2023-01-09 17:21:02,349 Bokeh app running at: http://localhost:5006/app
2023-01-09 17:21:02,350 Starting Bokeh server with process id: 25929
/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/evident/data_handler.py:72: UserWarning: Some categories have been dropped because they had either only one level or too many. Use the max_levels_per_category argument to modify this threshold.
Dropped columns: ['birth_timestamp', 'host_age', 'infection', 'mouse_number']
warn(
2023-01-09 17:21:04,054 Error running application handler <bokeh.application.handlers.directory.DirectoryHandler object at 0x7fdca1155dc0>: 'infection'
File 'base.py', line 3631, in get_loc:
raise KeyError(key) from err Traceback (most recent call last):
File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3629, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'infection'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/bokeh/application/handlers/code_runner.py", line 231, in run
exec(self._code, module.__dict__)
File "/media/jlu/vol/jlab/MicrobiomeAnalyses/Projects/Pandyra_LCMV/app/main.py", line 48, in <module>
effect_size_by_category(dh, binary_cols)
File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/evident/effect_size.py", line 49, in effect_size_by_category
results = Parallel(n_jobs=n_jobs, **parallel_args)(
File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/joblib/parallel.py", line 1046, in __call__
while self.dispatch_one_batch(iterator):
File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/joblib/parallel.py", line 861, in dispatch_one_batch
self._dispatch(tasks)
File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/joblib/parallel.py", line 779, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 572, in __init__
self.results = batch()
File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/joblib/parallel.py", line 262, in __call__
return [func(*args, **kwargs)
File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/joblib/parallel.py", line 262, in <listcomp>
return [func(*args, **kwargs)
File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/evident/data_handler.py", line 114, in calculate_effect_size
if self.metadata[column].dtype != np.dtype("object"):
File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/pandas/core/frame.py", line 3505, in __getitem__
indexer = self.columns.get_loc(key)
File "/home/sjanssen/miniconda3/envs/qiime2-2022.8/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3631, in get_loc
raise KeyError(key) from err
KeyError: 'infection'
2023-01-09 17:21:04,487 WebSocket connection opened
2023-01-09 17:21:04,487 ServerConnection created
^C
Interrupted, shutting down
Thanks for bringing this up. I'm not sure how to handle dates but for booleans I think we can just allow bool
dtype columns.
@sjanssen2 Can you try out this change and see if it resolves your boolean issue?