Better handle IA caching

Question

Better handle IA caching

cdrini opened this issue 5 years ago · 2 comments

Often get an error when requesting too much data; should handle this more elegantly, because things slow WAY down when we can't cache IA metadata.

2019-05-14 00:09:10,175 [ERROR] Error while caching IA
Traceback (most recent call last):
  File "solr_builder_main.py", line 149, in solr_builder.solr_builder_main.LocalPostgresDataProvider.cache_ia_metadata
    for doc in self._get_lite_metadata(b, rows=batch_size)['docs']:
  File "solr_builder_main.py", line 139, in solr_builder.solr_builder_main.LocalPostgresDataProvider._get_lite_metadata
    return simplejson.loads(resp_str)['response']
  File "/usr/local/lib/python2.7/dist-packages/simplejson/__init__.py", line 518, in loads
    return _default_decoder.decode(s)
  File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 373, in decode
    raise JSONDecodeError("Extra data", s, end, len(s))
JSONDecodeError: Extra data: line 3 column 1 - line 3 column 19998 (char 61 - 20058)

Answer 1 · 2019-07-22T18:48:36.000Z

I have a fix for this on my branch. It:

caches all ocaids regardless of length
queries IA in chunks to keep the URL size manageable
doesn't retry anything that's not in the cache (because it'll just fail again, like it did the first time)

Answer 2 · 2019-07-31T22:13:50.000Z

Having said that, I don't think we should be querying an API during a bulk load operation at all. I don't think we actually need this data, but if we do, we should get it from a bulk dump from IA and read it from that file as part of the indexing process.

Requiring a low latency network with 100% availability is too fragile.