aurelio-labs/semantic-router

Certificate verify failed when running unit tests on macbook for nltk data download

Closed this issue · 4 comments

Issue

When trying to run the unit tests for testing my development branches, I am unable to download the datasets. This blocks me from performing the tests.

Quick-fix

It seems like I can manually download these myself, using a custom fix. Instead of doing:

import ntlk

ntlk.download("stopwords")

I can do:

import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download("stopwords")

Solution

So having this fix seems relevant for this project. Could be that the issue is from some other library, but for now, it seems to be related to when initializing the BM25Encoder model. I would think other models which use datasets from nltk also could have issues.

I can draft a PR, if it is relevant to this project. If not, I can make a PR to whichever repo this issue originates from.

PyTest logs

___________________________________________________________________________________ ERROR collecting tests/unit/test_hybrid_layer.py ___________________________________________________________________________________
venv_test/lib/python3.10/site-packages/nltk/corpus/util.py:84: in __load
    root = nltk.data.find(f"{self.subdir}/{zip_name}")
venv_test/lib/python3.10/site-packages/nltk/data.py:583: in find
    raise LookupError(resource_not_found)
E   LookupError: 
E   **********************************************************************
E     Resource stopwords not found.
E     Please use the NLTK Downloader to obtain the resource:
E   
E     >>> import nltk
E     >>> nltk.download('stopwords')
E     
E     For more information see: https://www.nltk.org/data.html
E   
E     Attempted to load corpora/stopwords.zip/stopwords/
E   
E     Searched in:
E       - '/Users/andreped/nltk_data'
E       - '/Users/andreped/workspace/semantic-router/venv_test/nltk_data'
E       - '/Users/andreped/workspace/semantic-router/venv_test/share/nltk_data'
E       - '/Users/andreped/workspace/semantic-router/venv_test/lib/nltk_data'
E       - '/usr/share/nltk_data'
E       - '/usr/local/share/nltk_data'
E       - '/usr/lib/nltk_data'
E       - '/usr/local/lib/nltk_data'
E   **********************************************************************

During handling of the above exception, another exception occurred:
tests/unit/test_hybrid_layer.py:77: in <module>
    sparse_encoder = BM25Encoder(use_default_params=False)
semantic_router/encoders/bm25.py:27: in __init__
    self.model = encoder()
venv_test/lib/python3.10/site-packages/pinecone_text/sparse/bm25_encoder.py:59: in __init__
    self._tokenizer = BM25Tokenizer(
venv_test/lib/python3.10/site-packages/pinecone_text/sparse/bm25_tokenizer.py:26: in __init__
    self._stop_words = set(stopwords.words(language))
venv_test/lib/python3.10/site-packages/nltk/corpus/util.py:121: in __getattr__
    self.__load()
venv_test/lib/python3.10/site-packages/nltk/corpus/util.py:86: in __load
    raise e
venv_test/lib/python3.10/site-packages/nltk/corpus/util.py:81: in __load
    root = nltk.data.find(f"{self.subdir}/{self.__name}")
venv_test/lib/python3.10/site-packages/nltk/data.py:583: in find
    raise LookupError(resource_not_found)
E   LookupError: 
E   **********************************************************************
E     Resource stopwords not found.
E     Please use the NLTK Downloader to obtain the resource:
E   
E     >>> import nltk
E     >>> nltk.download('stopwords')
E     
E     For more information see: https://www.nltk.org/data.html
E   
E     Attempted to load corpora/stopwords
E   
E     Searched in:
E       - '/Users/andreped/nltk_data'
E       - '/Users/andreped/workspace/semantic-router/venv_test/nltk_data'
E       - '/Users/andreped/workspace/semantic-router/venv_test/share/nltk_data'
E       - '/Users/andreped/workspace/semantic-router/venv_test/lib/nltk_data'
E       - '/usr/share/nltk_data'
E       - '/usr/local/share/nltk_data'
E       - '/usr/lib/nltk_data'
E       - '/usr/local/lib/nltk_data'
E   **********************************************************************
--------------------------------------------------------------------------------------------------- Captured stderr ----------------------------------------------------------------------------------------------------
[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1007)>
[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1007)>

Hi, I cannot replicate this, are you sure that the network you're running this from doesn't intercept your traffic? You shouldn't need to run without SSL verification.

In [5]: nltk.download("stopwords")
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/bogdanbuduroiu/nltk_data...
Out[5]: True

It could be that the network does, but this I cannot control. I have had similar SSL issues when installing some python packages in the past. Especially on VPN. This time I used the hotspot shared with my phone, as I was in transit.


EDIT: Regardless, why is SSL necessary for this use case? Aren't we just downloading a public dataset? My fix enables people to run tests and download datasets using nltk without strict requirements on SSL.

As we're downloading a public dataset from a third party, we want to keep a strict SSL requirement to prevent against cybersec vulnerabilities (man-in-the-middle comes to mind first). Closing for now

we want to keep a strict SSL requirement to prevent against cybersec vulnerabilities

OK, thats understandable. Thanks for the clarification, @bruvduroiu! :]