ualbertalib/jupiter

Error trying to harvest ERA via OAI

sfarnel opened this issue · 11 comments

Describe the bug
EBSCO (who harvests ERA for inclusion in EDS, our discovery service) has recently (beginning mid-March) run into an issue when trying to harvest ERA. This is what they have described:

"We've recently encountered an issue with the OAI data harvests for your Institutional Repository, "ERA: University of Alberta Institutional Repository (ir00008a)". Harvest has been aborted by an error "ERROR: Could not harvest from https://era.library.ualberta.ca/oai: Error while harvesting using setSpec '90b3539f-2198-4058-9786-5e3ccd9e671a:7966f60a-4353-4286-941b-dbd33cd74867'. Will not continue.

Some reasons this might be happening:

--If your server is offline for a technical issue?

--If your server is shut down on a regular schedule?

--If we should still use the following base URL for our harvest attempts https://era.library.ualberta.ca/oai

In addition, is it possible that our IP address is blocked by a firewall? In order to access and harvest your IR records via OAI-PMH, we need the following IP addresses added to your network exceptions list:

140.234.252.9

140.234.253.9

140.234.255.9"

If it's of any help, I have been able to use the basic browser approach to get records from this set without an issue; see https://era.library.ualberta.ca/oai/?verb=ListRecords&set=90b3539f-2198-4058-9786-5e3ccd9e671a:7966f60a-4353-4286-941b-dbd33cd74867&metadataPrefix=oai_dc

Not sure if this was a one-time issue or if something more substantial is happening.

Spoke with @nmacgreg because he was working on making firewall rules permanent. The IP from the EBSCO message is not being blocked.

I have a hunch that this may be caused by the seemingly irregular 3:00am redis reboots which would bring this service down, I have yet to verify this however

To add another thought to the above thought. The below is an excerpt from the HAProxy log for IP 140.234.252.9 on April 1. Notice the HTTP 500 error in the first. The second is a request with the same URI, it succeeded with an HTTP 200 but required 5191ms. Does this indicate that the OAI client sent a request, then waits 5000ms and timed out thus sending a second request which caused the HTTP 500 error? I'm assuming the log timestamps recorded at the end of the HTTP session (not the start) thus the second request in the log actually starts before the first.

Testing response time:

  • all requests not following the above patter ( HTTP 500 then HTTP 200 request) have a response time of 1130ms or lower. Why do the problem URIs require a longer response time? The sets are not larger. Is the time due to other database operations and may be helped by #2826 and PR #3428?

Why do two requests to the same URI result in different HTTP status codes (the shorter one is HTTP 500 while the longer is HTTP 200)?

Apr  1 14:00:50 telford haproxy[1188]: 140.234.252.9:10400 [01/Apr/2024:14:00:50.082] jupiter-ssl-beta~ jupiter2-backend/era-app-prd-1 99/0/1/355/+455 500 +476 - - ---- 205/67/2/2/0 0/0 "GET /oai?verb=ListRecords&metadataPrefix=oai_dc&set=2c5be215-7943-434e-b96e-4390bc16b00a:5065d96a-6fcd-4870-aa03-6b94b9464064&from=2024-02-01T21:00:23Z&until=2024-04-01T20:00:24Z HTTP/1.1"
Apr  1 14:00:55 telford haproxy[1188]: 140.234.252.9:10400 [01/Apr/2024:14:00:50.537] jupiter-ssl-beta~ jupiter2-backend/era-app-prd-1 5124/0/1/66/+5191 200 +1077 - - ---- 204/67/1/1/0 0/0 "GET /oai?verb=ListRecords&metadataPrefix=oai_dc&set=2c5be215-7943-434e-b96e-4390bc16b00a:5065d96a-6fcd-4870-aa03-6b94b9464064&from=2024-02-01T21:00:23Z&until=2024-04-01T20:00:24Z HTTP/1.1"
Apr  1 14:00:55 telford haproxy[1188]: 140.234.252.9:10400 [01/Apr/2024:14:00:55.729] jupiter-ssl-beta~ jupiter2-backend/era-app-prd-2 126/0/1/68/+195 200 +1077 - - ---- 204/67/1/1/0 0/0 "GET /oai?verb=ListRecords&metadataPrefix=oai_dc&set=81420852-41c5-4624-8b60-4188d702ded6:099e7f31-a5c1-4005-bfa7-b6c7dd6349b7&from=2024-02-01T21:00:23Z&until=2024-04-01T20:00:24Z HTTP/1.1"
Apr  1 14:00:56 telford haproxy[1188]: 140.234.252.9:10400 [01/Apr/2024:14:00:55.924] jupiter-ssl-beta~ jupiter2-backend/era-app-prd-1 93/0/0/66/+159 200 +1077 - - ---- 204/67/1/1/0 0/0 "GET /oai?verb=ListRecords&metadataPrefix=oai_dc&set=2c5be215-7943-434e-b96e-4390bc16b00a:28738add-6881-447b-a655-ca9331e3cadc&from=2024-02-01T21:00:23Z&until=2024-04-01T20:00:24Z HTTP/1.1"

Description of the the HAProxy HTTP log format: https://sematext.com/blog/haproxy-logs/#http-format

RE: Connor's speculation about this being related to the reboot of Redis:

  • Yes, redis is rebooted weekly, as are all VM's
  • Yes, there is some randomness built into the day and time of the reboot

I have 2 questions:

  • Is redis used in responding to an OAI query? (I thought redis was a queuing subsystem... emoji for head-scratch!)
  • Recent testing shows :
    • 100% of queries to era-app-prd-1 get "200" responses,
    • ...but only ~10% of queries to era-app-prd-2 get "200", the rest get "500"s.
    • (You must mangle your /etc/hosts for this test; don't forget to undo this!)
    • One Prod server works, the other is intermittent. How is this related to redis?

Jeff was provocative on Slack & brought this to my attention.

! Note that to perform this testing, you must mangle your /etc/hosts to make era.library.ualberta.ca "point to" the host address for either era-app-prd-1 or era-app-prd-2, and it's frightfully important that you undo this mangle after testing! !

Use this command for testing: curl "https://era.library.ualberta.ca/oai/?verb=ListRecords&set=90b3539f-2198-4058-9786-5e3ccd9e671a:7966f60a-4353-4286-941b-dbd33cd74867&metadataPrefix=oai_dc"

The HTTPD logs on the Prod server era-app-prd-2 do record the 500 response, but don't give any useful details.

OTOH, the production.log is pretty chatty about the 500 response, and hands off to rollbar. Maybe this will give you a place to start? Can you recreate this in Dev, for instance? Or, has anybody tried, in Staging? Is it just this one URL, or are other URL's failing? What is source of the data used to inform the response to this query -- redis, postgresql, or an maybe an external web resource -- and can you test that backend to find the source of the flakiness?

I was unable to immediately find a explanation for this query working 100% of the time on one Prod server, but intermittently failing on another. Our business processes are really carefully designed to prevent differences between servers. I'm open to the idea of either building a new Prod VM, or performing the kind of rebuild operation we'd use to recover from a bad update (eg delete the jupiter package; clean primary directories; use the playbook to reinstall) on era-app-prd-2. Find me on Slack if you want to pursue this; I'm stepping away.

A test from yesturday:

I try an OAI request on each of the 3 prod Jupiter servers. The request works on only 2 of 3 with the third returning a 500 error.

curl "https://era.library.ualberta.ca/oai/?verb=ListRecords&set=90b3539f-2198-4058-9786-5e3ccd9e671a:7966f60a-4353-4286-941b-dbd33cd74867&metadataPrefix=oai_dc"

I'm changing my /etc/hosts file to isolate each prod server
prod-1: works
prod-2: fails
prod-3: works

The following also succeeds on 2 of 3 prod servers.

https://era.library.ualberta.ca/oai?verb=ListRecords&metadataPrefix=oai_dc

Fast forward to today, I tried reproducing yesterday's test and received different results (all 3 prod servers worked for both test URIs). What can change day-to-day and server to server?

  • ERA resources: no, error would appear across all servers
  • state/data that is time dependent and server dependent (e.g., like the sitemap cache problem)
  • One test returned a resumption token (pagination) one didn't (because all items fit in a single request).

Todo:

Fyi, I think curl --resolve might be able to replace the need to change /etc/host:

curl --resolve "era.library.ualberta.ca:443:${IP}" "https://era.library.ualberta.ca/oai/?verb=ListRecords&set=90b3539f-2198-4058-9786-5e3ccd9e671a:7966f60a-4353-4286-941b-dbd33cd74867&metadataPrefix=oai_dc"


Update 11:55 (two hours later:
* below fails on prod-1 & prod-2; succeeds on prod-3

curl --resolve "era.library.ualberta.ca:443:${IP}" "https://era.library.ualberta.ca/oai/?verb=ListRecords&set=90b3539f-2198-4058-9786-5e3ccd9e671a:7966f60a-4353-4286-941b-dbd33cd74867&metadataPrefix=oai_dc"

curl --resolve "era.library.ualberta.ca:443:${IP}" "https://era.library.ualberta.ca/oai/?verb=ListRecords&metadataPrefix=oai_dc"

I had a chance to do some digging into this as I am fairly familiar with Oaisys, Redis is leveraged in Oaisys as a key value store to handle resumption tokens (defined OAI-PMH parameter). I believe Redis is on one server: with the discovery of the server discrepancy the redis reboots were likely a red herring.

I spend some time in the Jupiter/Oaisys/ActsAsRdfable code-bases and have narrowed some things down. I believe this is the piece of code throwing the error (https://github.com/ualbertalib/acts_as_rdfable/blob/e7ae5f76157139bae3804230a0d789affee1a9af/lib/acts_as_rdfable/acts_as_rdfable_core.rb#L26). format is given by Oaisys through the metadataPrefix url param (https://github.com/ualbertalib/oaisys/blob/master/app/controllers/oaisys/pmh_controller.rb#L102 --> https://github.com/ualbertalib/oaisys/blob/4d0f657480658343ba01cfe39f71ef0060ccdcab/app/views/oaisys/pmh/list_records.xml.builder#L33) which appears to be present in these requests. I wonder if it is related to this change (ualbertalib/acts_as_rdfable@e6f8604)

I did notice that the first occurrence we have of this error in rollbar was Feb 9th, 2024 (https://app.rollbar.com/a/ualbertalib/fix/item/jupiter/1833) which more or less lines up with the release containing the change linked above (jupiter version 2.7.0 released Jan 20, 2024.) which adds to my suspicion, that release also had a major Rails upgrade however. One thing that is likely unrelated but came to mind is an old ticket in shortcut (previously clubhouse which was used for project management at the time) which I have included a screenshot of below:

oaisysticket

Stepping away from this rabbit hole for now

I suspect Connor's message is on the right track. I've been unable to find another explanation nor reproduce reliably.

Expanding a bit on what I think Connor is saying:

  • format is correctly passed into the request (otherwise validation earlier in the call stack would trigger a different error)
  • the singleton_class / singleton_method are a potential the best areas for further investigation: ualbertalib/acts_as_rdfable@e6f8604
raise InvalidArgumentError unless self.class.supported_metadata_serialization_formats.include?(format)

https://github.com/ualbertalib/acts_as_rdfable/blob/e7ae5f76157139bae3804230a0d789affee1a9af/lib/acts_as_rdfable/acts_as_rdfable_core.rb#L26

Thoughts:

  • what does the change from self.singleton_class.supported_metadata_serialization_formats to self.class.supported_metadata_serialization_formats impact? Is the inability to consistently reproduce over time related to object cache: https://shopify.engineering/caching-without-marshal-part-one?
  • does the singleton pattern fail when transitioning from oai_dc to oai_etdms? Or transitioning from theses to items?

I've updated the previous comments with my existing thoughts. My Library sprint ends this week so I'm returning the issue to the queue in the event another person has cycles to move this forward.

This seems to be an issue with how Puma through HAProxy handles the requests which store/retrieve the supported_metadata_serialization_formats. I suspect that refactoring the affected code mentioned above using something like request store (https://github.com/steveklabnik/request_store) would resolve this issue. Considering ERA's upcoming migration, the fact that only one harvester has encountered this (I presume that some OAI harvesters would have error handling to get around this issue), and the work/risk required to refactor acts_as_rdfable, I feel like it is probably best to put this on hold at this point in time.

Thanks @ConnorSheremeta I think that holding off on this makes sense.