Saving `is_training_set_available` in `sys_info` during `get_overall_statistics()`

Question

Saving `is_training_set_available` in `sys_info` during `get_overall_statistics()`

Opened this issue 2 years ago · 2 comments

Although this issue occurs in the web interface. I'm writing it here as it's mainly SDK-related.

Problem

In the web interface:

[0]   File "/Users/oscar/opt/anaconda3/envs/exb/lib/python3.9/site-packages/explainaboard/processors/processor.py", line 252, in perform_analyses
[0]     my_analysis.perform(
[0]   File "/Users/oscar/opt/anaconda3/envs/exb/lib/python3.9/site-packages/explainaboard/analysis/analyses.py", line 191, in perform
[0]     raise RuntimeError(f"bucket analysis: feature {self.feature} not found.")

In SDK:

The function _gen_cases_and_stats() in conditional_generation.py (called by processor.py’s get_overall_statistics()) skips saving require_training_set=True example-level features. However, these skipped feature names are saved in sys_info.analysis_levels[0].

This causes perform() in BucketAnalysis in analyses.py to attempt to look up these features and throw the above error as the features cannot be found in the actual cases (since they are skipped).

Quick fix

Set skip_failed_analyses=True.

Long-term solution

Following up on #410, we should save a flag like is_training_set_available in sys_info. If set to false, we should skip the require_training_set=True features during bucket analysis.

Answer 1 · 2022-09-06T23:33:26.000Z

@OscarWang114 Thanks for reporting the issue!

First, could skip_failed_analyses=True in Processor.process be a quick fix, or does not it satisfy the use case?

I also agree with having more specific control around feature groups (in this case, train-only or not). Is the flag name just is_trainint_set rather than is_training_set_available?

Answer 2 · 2022-09-06T23:38:50.000Z

@odashi Thanks! Yes,skip_failed_analyses=True is a valid quick fix; I updated the issue description. And thanks for catching the typo (also updated).