hubmapconsortium/search-api

Troubleshooting Elasticsearch issue with TEST indices

Closed this issue · 2 comments

All of sudden the full reindex on TEST failed to write documents to the hm_test_ indices, only a small number of documents were indexed initially. The search-api logging indicates that the indexing process is still in progress but there's no update on the counts in OpenSearch console, even after a few hours. And I portal-ui on TEST was basically useless.
Screenshot 2024-06-27 at 11 24 44 PM

Screenshot 2024-06-27 at 11 26 02 PM

I tried with the following but still having the same issue:

  • Reconfigured the hubmap-dev-test Elasticsearch cluster to trigger the "reset" or "reboot"
  • Erased all the old hm_test_ indices and recreated them
  • Connected to a separate set of indices hm_teast_627_
  • Triggered PUT /reindex-all (the logging shows it's running but no change on the actual data nodes)

None of them made a difference.
Screenshot 2024-06-27 at 11 11 48 PM

Also tried the following:

  • Point the TEST search-api to a new set of indices on the PROD cluster, same issue and the documents stopped to be indexed at one point.
  • Fired up a local search-api instance to point to the TEST indices, at one point all the doc counts got reset to 0.

I submitted a help ticket and chatted with the AWS tech support, we made configuration updates to bring the cluster from Yellow to Green. The internal team verified that the cluster's data nodes are fine and there are no unassigned shards.

I did further investigation and debugging to rule out any causes on my end, and I did finally figured out the root cause. It was a BAD data in our database, which caused infinite loop... That also explained why a small number of documents got indexed and after that no more documents added to the Elasticsearch indices.

Dataset 421007293469db7b528ce6478c00348d has itself as parent and this caused the index procedure to endlessly loop through this node and would never get to other entities.

Screenshot 2024-06-28 at 8 03 53 PM

I deleted the Activity node (5987bb5d5b7783878448fc4cf3150634) and the input/output relationships. Also recreated with using the correct director ancestor, which is Sample ee5c22a10c313e58fbfbd11aa2892cf6.