AustinHasten/PlexHolidays

Script Freezing on Large Libraries

nickheidke opened this issue · 21 comments

I'm attempting to use the script on my "TV Shows" library which has ~8,000 items. Despite restarting the script a few different times and trying different keywords, the script freezes up at around 90-95% each time.

In this screenshot, it's been locked at 90% for over 8 hours:
image

Another use had a similar issue but I had resolved it. I'll look into this some more if I find the time.

jh888 commented

I'm having the same issue. I've never done much in Python since I'm a Perl guy, but I found that if I comment out the "logging.getLogger" lines that disabled some logging I can see what's happening. For many titles I'm getting a 404 error back from IMDB. It seems to keep retrying each failed GET request every second for each thread and that's why it gets stuck. Such as:

2018-11-15 17:09:37,522 CRITICAL [imdbpy] C:\Python37\lib\site-packages\imdb\_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tttt2308733?/keywords', 'proxy': '', 'exception type': 'IOError', 'original exception': <HTTPError 404: 'Not Found'>},); kwds: {}
Traceback (most recent call last):
  File "C:\Python37\lib\site-packages\imdb\parser\http\__init__.py", line 231, in retrieve_unicode
    response = uopener.open(url)
  File "C:\Python37\lib\urllib\request.py", line 531, in open
    response = meth(req, response)
  File "C:\Python37\lib\urllib\request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Python37\lib\urllib\request.py", line 563, in error
    result = self._call_chain(*args)
  File "C:\Python37\lib\urllib\request.py", line 503, in _call_chain
    result = func(*args)
  File "C:\Python37\lib\urllib\request.py", line 755, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "C:\Python37\lib\urllib\request.py", line 531, in open
    response = meth(req, response)
  File "C:\Python37\lib\urllib\request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Python37\lib\urllib\request.py", line 569, in error
    return self._call_chain(*args)
  File "C:\Python37\lib\urllib\request.py", line 503, in _call_chain
    result = func(*args)
  File "C:\Python37\lib\urllib\request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

This is the link it is failing to load:
http://www.imdb.com/title/tttt2308733?/keywords

It actually looks like the link should only have two t's in the URI, though. Like this:
http://www.imdb.com/title/tt2308733?/keywords

Thanks for all your work on this script, by the way.

@jh888 Thanks for looking into it, I've been busy with other pet projects. Could you insert the following line just after line 78 and give me a sampling of some of the output?

print('INFO', imdb_id)

Line 103 should be keeping it to 2 t's rather than 4, but apparently not.

jh888 commented

Thanks Austin.

It looks to me that all of the imdb_id's are good and only have two t's. I see there's a little different logic to handling TV episodes versus movies and it's interesting that I had no issues with my TV libraries - only with movie libraries. So, I think the logic in line 90 might be all that's done for the movies without going into get_episode_id.

From what I can tell, though, the imdbpy.get_movie_keywords function is being given the correct imdb_id's. Would that mean a bug in imdbpy? I have imdbpy version 6.6 and python 3.7.1 on Windows.

Keyword (i.e. Holiday name): christmas
Christmas Movies:   0%|                                                                      | 0/56 [00:00<?, ?it/s]INFO tt0104431?
INFO tt3922810?
INFO tt0790604?
INFO tt0319343?
INFO tt0304669?
INFO tt6645614?
INFO tt0897387?
INFO tt2308733?
2018-11-15 18:45:11,850 CRITICAL [imdbpy] C:\Python37\lib\site-packages\imdb\_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'https://www.imdb.com/title/tttt0104431?/keywords', 'proxy': '', 'exception type': 'IOError', 'original exception': <HTTPError 404: 'Not Found'>},); kwds: {}

(this continues with errors for each of the imdb_id's)

@jh888 Actually I don't think the regex I used should be capturing the 'tt', so it shouldn't be in your output. IMDbPy is likely prepending the 'tt' also, leading to the 'tttt'. Not sure why this is happening. You could try adding [2:] to the end of line 90 and see if that helps.

jh888 commented

@AustinHasten That makes sense about the imdb_id's not needing the t's. I added the [2:] to the end of line 90 to remove the first two characters like so:
return re.search(r'tt(\d*)\?', plex_guid).group()[2:]

That resulted in the imdb_id's not having any t's and the URI's were correct. I ran it against several libraries of movies and it's completing successfully now. Thanks!

@jh888 Great! Not sure what's causing that to happen. The only text captured should be that within the parentheses on line 90, which as you can see doesn't include the 'tt'. Oh well.
@nickheidke Would you like to try the same steps?

I changed line 90 to:

return re.search(r'tt(\d*)?', plex_guid).group()[2:]

Then ran py .\__init__.py, which I assumed will use the latest copy of the file. It's still hanging at about 95% for me (7273/7684 episodes).

The script uses 10 threads, I assume some of your files are perpetually throwing exceptions, until all 10 threads are occupied by these problematic files. The solution would be to limit the max number of retries in the @Retry statements. I don't remember why I didn't include this originally, probably because I never found any files that didn't eventually resolve.

I tried limiting the number of retries. The script runs to about 90-95%, then reports an error like this:

Traceback (most recent call last):
File "C:\Python37\lib\site-packages\imdb\parser\http_init_.py", line 231, in retrieve_unicode
response = uopener.open(url)
File "C:\Python37\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Python37\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python37\lib\urllib\request.py", line 563, in error
result = self._call_chain(*args)
File "C:\Python37\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Python37\lib\urllib\request.py", line 755, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "C:\Python37\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Python37\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python37\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "C:\Python37\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Python37\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

Is there a way I can have it just "eat" the error? I'm not terribly concerned about it missing a single episode or something.

I would try adding this after line 80, before the finally statement:

except urllib.error.HTTPError:
    return (False, None)

It's apparent that the script needs a bit of maintenance between api changes and now an unhandled exception. Problem is I would usually play with this in my downtime at work, where I obviously don't have access to my home network for testing.

Tried adding the exception handler, but a new exception was encountered during the handler:

NameError: name 'urllib' is not defined

Do I need an additional import statement of some kind?

Try just HTTPError instead.

Got to 99% this time, but then hit:

NameError: name 'HTTPError' is not defined

Dag nabbit, I'm sorry for the back and forth. Keep it to just HTTPError, but add this import statement at the top of the file:

from urllib.error import HTTPError

No worries! I really don't mind being a bit of a guinea pig. New exception this time:

...
File "C:\Python37\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "C:\Python37\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Python37\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "._init_.py", line 120, in
ph = PlexHolidays()
File "._init_.py", line 63, in init
self.results = ThreadPool(10).map(self.find_matches, self.plex.media)
File "C:\Python37\lib\multiprocessing\pool.py", line 290, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "C:\Python37\lib\multiprocessing\pool.py", line 683, in get
raise self.value
File "C:\Python37\lib\multiprocessing\pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "C:\Python37\lib\multiprocessing\pool.py", line 44, in mapstar
return list(map(*args))
File "._init
.py", line 80, in find_matches
imdb_keywords = self.get_imdb_keywords(imdb_id)
File "", line 2, in get_imdb_keywords
File "C:\Python37\lib\site-packages\retry\api.py", line 74, in retry_decorator
logger)
File "C:\Python37\lib\site-packages\retry\api.py", line 33, in retry_internal
return f()
File "._init
.py", line 116, in get_imdb_keywords
data = self.imdbpy.get_movie_keywords(imdb_id)['data']
File "C:\Python37\lib\site-packages\imdb\parser\http_init
.py", line 468, in get_movie_keywords
cont = self.retrieve(self.urls['movie_main'] % movieID + 'keywords')
File "C:\Python37\lib\site-packages\imdb\parser\http_init
.py", line 406, in retrieve
ret = self.urlOpener.retrieve_unicode(url, size=size)
File "C:\Python37\lib\site-packages\imdb\parser\http_init
.py", line 265, in retrieve_unicode
'original exception': e}
imdb._exceptions.IMDbDataAccessError: {'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2911814/keywords', 'proxy': '', 'exception type': 'IOError', 'original exception': <HTTPError 404: 'Not Found'>}

Actually, you know what, just get rid of the HTTPError, so it should just be -

except:

on line 81 or whatever. Keep the return (False, None)

That did it! Approving my pull request should resolve the issue.

Glad to finally get a successful run. Seems that all of these issues could be simply due to the regex on line 90. It's capturing the 'tt' when it shouldn't be, and it's returning "2911814" as the IMDb ID for one of your items, which isn't a valid IMDb ID, though this ID could technically could be wrong in Plex and the regex is working properly.

A few of my TV episodes are likely unsearchable within IMDB (couple of random Youtube "series"), so I'm not sure if those are causing an issue? I don't need them included in the playlist, so skipping them via the error handler is fine by me. Let me know if you want to use my library for any tests. Happy to accommodate.

Well, they must be searchable within IMDb in some way because we're passing the check on line 89, which means Plex has already searched for and found an IMDb ID for the episode and stored it in the "GUID". My guess is that when Plex scanned your media, that IMDb ID was valid, but the page has since been deleted. I can find higher and lower IMDb IDs than the one in the error you provided earlier, so that checks out with my theory.

It might be helpful to add something like the following line in that except block, just before the return statement:

self.pbar.write(f'Retries exceeded for "{medium.title}" with GUID {plex_guid}')

Then we can see the titles of episodes that are failing and check their Plex GUIDs to see if they look alright.