stanford-crfm/helm

Breaking change in nltk 3.8.2

yifanmai opened this issue · 1 comments

Upstream issue: nltk/nltk#3293

We can work around by pinning the version to 3.8.1.

This causes errors in the tests such as:

E       LookupError: 
E       **********************************************************************
E         Resource punkt_tab not found.
E         Please use the NLTK Downloader to obtain the resource:
E       
E         >>> import nltk
E         >>> nltk.download('punkt_tab')
E         
E         For more information see: https://www.nltk.org/data.html
E       
E         Attempted to load tokenizers/punkt_tab/english/
E       
E         Searched in:
E           - '/home/runner/nltk_data'
E           - '/opt/hostedtoolcache/Python/3.9.19/x64/nltk_data'
E           - '/opt/hostedtoolcache/Python/3.9.19/x64/share/nltk_data'
E           - '/opt/hostedtoolcache/Python/3.9.19/x64/lib/nltk_data'
E           - '/usr/share/nltk_data'
E           - '/usr/local/share/nltk_data'
E           - '/usr/lib/nltk_data'
E           - '/usr/local/lib/nltk_data'
E           - 'benchmark_output/perturbations/synonym'
E       **********************************************************************

Example stack trace:

src/helm/benchmark/metrics/test_bias_metrics.py:16: in check_test_cases
    bias_score = bias_func(test_case.texts)
src/helm/benchmark/metrics/bias_metrics.py:157: in evaluate_stereotypical_associations
    tokens = word_tokenize(text.lower())
/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/nltk/tokenize/__init__.py:129: in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/nltk/tokenize/__init__.py:106: in sent_tokenize
    tokenizer = PunktTokenizer(language)
/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/nltk/tokenize/punkt.py:1744: in __init__
    self.load_lang(lang)
/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/nltk/tokenize/punkt.py:1749: in load_lang
    lang_dir = find(f"tokenizers/punkt_tab/{lang}

Fixed by #3070.