urllib.error.HTTPError: HTTP Error 404: Not Found
oregonpillow opened this issue · 10 comments
Traceback (most recent call last):
File "run.py", line 7, in <module>
print(get_subtitles(video, lang='chi_sim+eng', sim_threshold=70, conf_threshold=65))
File "/home/ubuntu/Github/videocr/env/lib/python3.8/site-packages/videocr/api.py", line 8, in get_subtitles
utils.download_lang_data(lang)
File "/home/ubuntu/Github/videocr/env/lib/python3.8/site-packages/videocr/utils.py", line 21, in download_lang_data
with urlopen(url) as res, open(filepath, 'w+b') as f:
File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.8/urllib/request.py", line 531, in open
response = meth(req, response)
File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
response = self.parent.error(
File "/usr/lib/python3.8/urllib/request.py", line 569, in error
return self._call_chain(*args)
File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
not sure why this is happening. I'm guessing it's a version problem. Trying to run the example code with my own video (full system path specified)
@apm1467 Any chance you could provide the exact python version, tesseract version you used successfully?
I am facing to exact same problem. I also tried fixing the urls referenced in constants.py
TESSDATA_URL = 'https://github.com/tesseract-ocr/tessdata_fast/raw/master/{}.traineddata'
TESSDATA_SCRIPT_URL = 'https://github.com/tesseract-ocr/tessdata_best/raw/master/script/{}.traineddata'
since paths changed, but didnt solve the problem.
I had the same issue, and I don't know how to fix the automated download. However, if you simply go to https://github.com/tesseract-ocr/tessdata_best
or https://github.com/tesseract-ocr/tessdata_fast
, manually download the language files you need (so when in doubt just get all of them) and put them into the folder also referenced in constants.py you will not need the automated download anymore. Not perfect, but good enough for me
I ran into the same issue and putting
TESSDATA_URL = 'https://github.com/tesseract-ocr/tessdata_fast/blob/main/{}.traineddata?raw=true'
TESSDATA_SCRIPT_URL = 'https://github.com/tesseract-ocr/tessdata_best/blob/main/{}.traineddata?raw=true'
in constants.py fixes the downloading issue!
@hadis-git are you sure? it still gives me error.
substituting {} with the language needed worked.
What's the error you get?
Traceback (most recent call last):
File "example.py", line 6, in
videocr.save_subtitles_to_file('out.mkv', lang='dan')
File "C:\Users\CrisMattGiov\AppData\Roaming\Python\Python38\site-packages\videocr\api.py", line 20, in save_subtitles_to_file
f.write(get_subtitles(
File "C:\Users\CrisMattGiov\AppData\Roaming\Python\Python38\site-packages\videocr\api.py", line 8, in get_subtitles
utils.download_lang_data(lang)
File "C:\Users\CrisMattGiov\AppData\Roaming\Python\Python38\site-packages\videocr\utils.py", line 21, in download_lang_data
with urlopen(url) as res, open(filepath, 'w+b') as f:
File "C:\Program Files\Python38\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Program Files\Python38\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Program Files\Python38\lib\urllib\request.py", line 640, in http_response
response = self.parent.error(
File "C:\Program Files\Python38\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "C:\Program Files\Python38\lib\urllib\request.py", line 502, in _call_chain
result = func(*args)
File "C:\Program Files\Python38\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
Well, it's telling you that the url to the language models is wrong.
How about you print(url)
out that url right here?
Line 20 in 9b97c99
Then you'll see if you edited the right constants.py
This is because the branch name of tessdata_fast and tessdata_best changed from master to main, so the URL in file videocr/constants.py must changed, from :
TESSDATA_URL = 'https://github.com/tesseract-ocr/tessdata_fast/raw/master/{}.traineddata'
TESSDATA_SCRIPT_URL = 'https://github.com/tesseract-ocr/tessdata_best/raw/master/script/{}.traineddata'
to
TESSDATA_URL = 'https://github.com/tesseract-ocr/tessdata_fast/raw/main/{}.traineddata'
TESSDATA_SCRIPT_URL = 'https://github.com/tesseract-ocr/tessdata_best/raw/main/script/{}.traineddata'
we must wait for owner of this repository fix this issue, otherwise if you want to change it yourself, change this file in your pip library installation directory, in linux if you install using pip the directory is ~/.local/lib/python{version}/site-packages/videocr/
or /usr/local/lib/python{version}/dist-packages
check in google for other OS.
It should move
TESSDATA_URL = 'https://github.com/tesseract-ocr/tessdata_fast/raw/master/{}.traineddata' TESSDATA_SCRIPT_URL = 'https://github.com/tesseract-ocr/tessdata_best/raw/master/script/{}.traineddata'to
TESSDATA_URL = 'https://github.com/tesseract-ocr/tessdata_fast/blob/main/{}.traineddata' TESSDATA_SCRIPT_URL = 'https://github.com/tesseract-ocr/tessdata_best/blob/main/{}.traineddata'now.
you can also download the traineddata file and put it to filepath as well.