pulibrary/lib_jobs

Many POD Updates Missed from February

Opened this issue · 11 comments

Expected behavior

When updates happen in Alma to records matching our POD publishing profile these updates should be sent to POD aggregator.

Actual behavior

See https://pod.stanford.edu/organizations/princeton for the list of datafiles accepted by POD. You can see the volume of records processed in Feb. is much, much lower than the amount of data written out by the Daily Alma Publisning Job when you review the Publishing Jobs Log in Alma (which shows that many millions of updates passed through the POD process) and when you look at the output files on lib-sftp in /alma/pod you can see many > 100 MB files produced throughout February.

Steps to replicate

This issue requires investigation. We likley have to try and re-process updates from a day during this period with mass updates and observe the results. No exceptions related to this process appear to have been logged.

Impact of this bug

We have out of date data available to our Resource Sharing partners in Borrow Direct. The updates missed likely represent very few new records since the volume here was caused by mass record clean-up work by CaMS.

Honeybadger link and code snippet, if applicable

This may be another version of the out of memory issue #695.

Implementation notes, if any

See https://pod.stanford.edu/organizations/princeton for the list of datafiles accepted by POD. You can see the volume of records processed in Feb. is much, much lower than the amount of data written out by the Daily Alma Publisning Job. Interestingly if you look in the history of Princeton's submissions to POD you can see a period 8/1/2023-8/15/2023 where on some days we submitted files successfully with many hundreds of thousands of record updates so we've seen a period where a somewhat comparable volume of data passed successfully through our process. The data in February is greater in volume than August so perhaps maybe we reached a tipping point of some sort.

Acceptance Criteria

  • Look at the cron log on these servers (we found for the Submit Collection issue that Out of Memory errors were not being logged anywhere)
  • Identify the root cause of the error
  • Create tickets to address the error (if possible)
  • Create a ticket/plan for how we can get the POD data current after we assess this issue

Wrote the POD support team to see if they have logged any errors related to PUL submissions in the last several months, since we are seeing almost daily discrepancies btw Alma publishing on what POD logs.

POD support team says they see no errors related to PUL submissions and the counts of records processed the POD UI displays is accurate based on the files we've actual submitted to POD.

I will advise the POD we want to do a full refresh of the data by republishing the set in Alma publishing. Once we do that we'll re-publish the set to the stream in POD that they tell us to use. They may want them in a new stream to keep the size of our record data set down in POD.

Per @rladdusaw running the process locally worked fine with five alma dump files from 2/23 (totaling 1075929 records) if you exclude the upload to POD. No Upload was recorded at all on the POD website for this date. Monitoring this the highest memory usage was 200MB. Potentially the issue relates to the POST we make to send the processed data.

POD advised we should republish to a new stream: right the default is our current production stream:

pod_default_stream: <%= ENV["POD_DEFAULT_STREAM"] || "princeton-prod-0223" %>
.

We can redirect the republishing event to this stream: https://pod.stanford.edu/organizations/princeton/streams/princeton-prod-0424.

@mzelesky please ping us when you think this data set is ready to be refreshed.

It probably makes sense to wait until we restore the records affected by the WorldShare process, since 7 million+ records will be modified by that restoration.

Ok, when the plan for this becomes clearer let's plan for this alongside our blacklight and submit collection updates.

Now that the DataSync updates have resumed again per @mzelesky as of 4/23/2024. We still continue to see discrepancies from POD publishing in Alma vs. records that the POD platform reports processed. See these two screenshots for 4/19/2024-4/26/2024.

For POD
Screenshot 2024-04-26 at 9.19.41 AM.png

For Alma Publishing
Screenshot 2024-04-26 at 9.20.01 AM.png

None of the records processed in Alma are at all in line with what POD has processed for the same dates.