apm1467/videocr

urllib.error.HTTPError: HTTP Error 404: Not Found

oregonpillow opened this issue · 10 comments

Traceback (most recent call last):
  File "run.py", line 7, in <module>
    print(get_subtitles(video, lang='chi_sim+eng', sim_threshold=70, conf_threshold=65))
  File "/home/ubuntu/Github/videocr/env/lib/python3.8/site-packages/videocr/api.py", line 8, in get_subtitles
    utils.download_lang_data(lang)
  File "/home/ubuntu/Github/videocr/env/lib/python3.8/site-packages/videocr/utils.py", line 21, in download_lang_data
    with urlopen(url) as res, open(filepath, 'w+b') as f:
  File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/usr/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

not sure why this is happening. I'm guessing it's a version problem. Trying to run the example code with my own video (full system path specified)

@apm1467 Any chance you could provide the exact python version, tesseract version you used successfully?

I am facing to exact same problem. I also tried fixing the urls referenced in constants.py
TESSDATA_URL = 'https://github.com/tesseract-ocr/tessdata_fast/raw/master/{}.traineddata'
TESSDATA_SCRIPT_URL = 'https://github.com/tesseract-ocr/tessdata_best/raw/master/script/{}.traineddata'
since paths changed, but didnt solve the problem.

I had the same issue, and I don't know how to fix the automated download. However, if you simply go to https://github.com/tesseract-ocr/tessdata_best or https://github.com/tesseract-ocr/tessdata_fast, manually download the language files you need (so when in doubt just get all of them) and put them into the folder also referenced in constants.py you will not need the automated download anymore. Not perfect, but good enough for me

I ran into the same issue and putting

TESSDATA_URL = 'https://github.com/tesseract-ocr/tessdata_fast/blob/main/{}.traineddata?raw=true'

TESSDATA_SCRIPT_URL = 'https://github.com/tesseract-ocr/tessdata_best/blob/main/{}.traineddata?raw=true'

in constants.py fixes the downloading issue!

@hadis-git are you sure? it still gives me error.
substituting {} with the language needed worked.

Yes, I am sure,
The lang parameter in

def get_subtitles(
is split by '+', substituted into those constants. Then the models are downloaded here
def download_lang_data(lang: str):

So you have to make sure that your lang parameter corresponds to one or more of the available models.

What's the error you get?

Traceback (most recent call last):
File "example.py", line 6, in
videocr.save_subtitles_to_file('out.mkv', lang='dan')
File "C:\Users\CrisMattGiov\AppData\Roaming\Python\Python38\site-packages\videocr\api.py", line 20, in save_subtitles_to_file
f.write(get_subtitles(
File "C:\Users\CrisMattGiov\AppData\Roaming\Python\Python38\site-packages\videocr\api.py", line 8, in get_subtitles
utils.download_lang_data(lang)
File "C:\Users\CrisMattGiov\AppData\Roaming\Python\Python38\site-packages\videocr\utils.py", line 21, in download_lang_data
with urlopen(url) as res, open(filepath, 'w+b') as f:
File "C:\Program Files\Python38\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Program Files\Python38\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Program Files\Python38\lib\urllib\request.py", line 640, in http_response
response = self.parent.error(
File "C:\Program Files\Python38\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "C:\Program Files\Python38\lib\urllib\request.py", line 502, in _call_chain
result = func(*args)
File "C:\Program Files\Python38\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

Well, it's telling you that the url to the language models is wrong.
How about you print(url) out that url right here?

Then you'll see if you edited the right constants.py

This is because the branch name of tessdata_fast and tessdata_best changed from master to main, so the URL in file videocr/constants.py must changed, from :

TESSDATA_URL = 'https://github.com/tesseract-ocr/tessdata_fast/raw/master/{}.traineddata'

TESSDATA_SCRIPT_URL = 'https://github.com/tesseract-ocr/tessdata_best/raw/master/script/{}.traineddata'

to

TESSDATA_URL = 'https://github.com/tesseract-ocr/tessdata_fast/raw/main/{}.traineddata'

TESSDATA_SCRIPT_URL = 'https://github.com/tesseract-ocr/tessdata_best/raw/main/script/{}.traineddata'

we must wait for owner of this repository fix this issue, otherwise if you want to change it yourself, change this file in your pip library installation directory, in linux if you install using pip the directory is ~/.local/lib/python{version}/site-packages/videocr/ or /usr/local/lib/python{version}/dist-packages check in google for other OS.

It should move

TESSDATA_URL = 'https://github.com/tesseract-ocr/tessdata_fast/raw/master/{}.traineddata'

TESSDATA_SCRIPT_URL = 'https://github.com/tesseract-ocr/tessdata_best/raw/master/script/{}.traineddata'

to

TESSDATA_URL = 'https://github.com/tesseract-ocr/tessdata_fast/blob/main/{}.traineddata'

TESSDATA_SCRIPT_URL = 'https://github.com/tesseract-ocr/tessdata_best/blob/main/{}.traineddata'

now.

you can also download the traineddata file and put it to filepath as well.