Certificate verify failed when running unit tests on macbook for nltk data download
Closed this issue · 4 comments
Issue
When trying to run the unit tests for testing my development branches, I am unable to download the datasets. This blocks me from performing the tests.
Quick-fix
It seems like I can manually download these myself, using a custom fix. Instead of doing:
import ntlk
ntlk.download("stopwords")
I can do:
import nltk
import ssl
try:
_create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
pass
else:
ssl._create_default_https_context = _create_unverified_https_context
nltk.download("stopwords")
Solution
So having this fix seems relevant for this project. Could be that the issue is from some other library, but for now, it seems to be related to when initializing the BM25Encoder
model. I would think other models which use datasets from nltk
also could have issues.
I can draft a PR, if it is relevant to this project. If not, I can make a PR to whichever repo this issue originates from.
PyTest logs
___________________________________________________________________________________ ERROR collecting tests/unit/test_hybrid_layer.py ___________________________________________________________________________________
venv_test/lib/python3.10/site-packages/nltk/corpus/util.py:84: in __load
root = nltk.data.find(f"{self.subdir}/{zip_name}")
venv_test/lib/python3.10/site-packages/nltk/data.py:583: in find
raise LookupError(resource_not_found)
E LookupError:
E **********************************************************************
E Resource stopwords not found.
E Please use the NLTK Downloader to obtain the resource:
E
E >>> import nltk
E >>> nltk.download('stopwords')
E
E For more information see: https://www.nltk.org/data.html
E
E Attempted to load corpora/stopwords.zip/stopwords/
E
E Searched in:
E - '/Users/andreped/nltk_data'
E - '/Users/andreped/workspace/semantic-router/venv_test/nltk_data'
E - '/Users/andreped/workspace/semantic-router/venv_test/share/nltk_data'
E - '/Users/andreped/workspace/semantic-router/venv_test/lib/nltk_data'
E - '/usr/share/nltk_data'
E - '/usr/local/share/nltk_data'
E - '/usr/lib/nltk_data'
E - '/usr/local/lib/nltk_data'
E **********************************************************************
During handling of the above exception, another exception occurred:
tests/unit/test_hybrid_layer.py:77: in <module>
sparse_encoder = BM25Encoder(use_default_params=False)
semantic_router/encoders/bm25.py:27: in __init__
self.model = encoder()
venv_test/lib/python3.10/site-packages/pinecone_text/sparse/bm25_encoder.py:59: in __init__
self._tokenizer = BM25Tokenizer(
venv_test/lib/python3.10/site-packages/pinecone_text/sparse/bm25_tokenizer.py:26: in __init__
self._stop_words = set(stopwords.words(language))
venv_test/lib/python3.10/site-packages/nltk/corpus/util.py:121: in __getattr__
self.__load()
venv_test/lib/python3.10/site-packages/nltk/corpus/util.py:86: in __load
raise e
venv_test/lib/python3.10/site-packages/nltk/corpus/util.py:81: in __load
root = nltk.data.find(f"{self.subdir}/{self.__name}")
venv_test/lib/python3.10/site-packages/nltk/data.py:583: in find
raise LookupError(resource_not_found)
E LookupError:
E **********************************************************************
E Resource stopwords not found.
E Please use the NLTK Downloader to obtain the resource:
E
E >>> import nltk
E >>> nltk.download('stopwords')
E
E For more information see: https://www.nltk.org/data.html
E
E Attempted to load corpora/stopwords
E
E Searched in:
E - '/Users/andreped/nltk_data'
E - '/Users/andreped/workspace/semantic-router/venv_test/nltk_data'
E - '/Users/andreped/workspace/semantic-router/venv_test/share/nltk_data'
E - '/Users/andreped/workspace/semantic-router/venv_test/lib/nltk_data'
E - '/usr/share/nltk_data'
E - '/usr/local/share/nltk_data'
E - '/usr/lib/nltk_data'
E - '/usr/local/lib/nltk_data'
E **********************************************************************
--------------------------------------------------------------------------------------------------- Captured stderr ----------------------------------------------------------------------------------------------------
[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data] unable to get local issuer certificate (_ssl.c:1007)>
[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data] unable to get local issuer certificate (_ssl.c:1007)>
Hi, I cannot replicate this, are you sure that the network you're running this from doesn't intercept your traffic? You shouldn't need to run without SSL verification.
In [5]: nltk.download("stopwords")
[nltk_data] Downloading package stopwords to
[nltk_data] /Users/bogdanbuduroiu/nltk_data...
Out[5]: True
It could be that the network does, but this I cannot control. I have had similar SSL issues when installing some python packages in the past. Especially on VPN. This time I used the hotspot shared with my phone, as I was in transit.
EDIT: Regardless, why is SSL necessary for this use case? Aren't we just downloading a public dataset? My fix enables people to run tests and download datasets using nltk without strict requirements on SSL.
As we're downloading a public dataset from a third party, we want to keep a strict SSL requirement to prevent against cybersec vulnerabilities (man-in-the-middle comes to mind first). Closing for now
we want to keep a strict SSL requirement to prevent against cybersec vulnerabilities
OK, thats understandable. Thanks for the clarification, @bruvduroiu! :]