cdrini/openlibrary

Better handle IA caching

cdrini opened this issue · 2 comments

Often get an error when requesting too much data; should handle this more elegantly, because things slow WAY down when we can't cache IA metadata.

2019-05-14 00:09:10,175 [ERROR] Error while caching IA
Traceback (most recent call last):
  File "solr_builder_main.py", line 149, in solr_builder.solr_builder_main.LocalPostgresDataProvider.cache_ia_metadata
    for doc in self._get_lite_metadata(b, rows=batch_size)['docs']:
  File "solr_builder_main.py", line 139, in solr_builder.solr_builder_main.LocalPostgresDataProvider._get_lite_metadata
    return simplejson.loads(resp_str)['response']
  File "/usr/local/lib/python2.7/dist-packages/simplejson/__init__.py", line 518, in loads
    return _default_decoder.decode(s)
  File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 373, in decode
    raise JSONDecodeError("Extra data", s, end, len(s))
JSONDecodeError: Extra data: line 3 column 1 - line 3 column 19998 (char 61 - 20058)

I have a fix for this on my branch. It:

  • caches all ocaids regardless of length
  • queries IA in chunks to keep the URL size manageable
  • doesn't retry anything that's not in the cache (because it'll just fail again, like it did the first time)

Having said that, I don't think we should be querying an API during a bulk load operation at all. I don't think we actually need this data, but if we do, we should get it from a bulk dump from IA and read it from that file as part of the indexing process.

Requiring a low latency network with 100% availability is too fragile.