Troubleshooting Elasticsearch issue with TEST indices
Closed this issue · 2 comments
All of sudden the full reindex on TEST failed to write documents to the hm_test_
indices, only a small number of documents were indexed initially. The search-api logging indicates that the indexing process is still in progress but there's no update on the counts in OpenSearch console, even after a few hours. And I portal-ui on TEST was basically useless.
I tried with the following but still having the same issue:
- Reconfigured the
hubmap-dev-test
Elasticsearch cluster to trigger the "reset" or "reboot" - Erased all the old
hm_test_
indices and recreated them - Connected to a separate set of indices
hm_teast_627_
- Triggered
PUT /reindex-all
(the logging shows it's running but no change on the actual data nodes)
Also tried the following:
- Point the TEST search-api to a new set of indices on the PROD cluster, same issue and the documents stopped to be indexed at one point.
- Fired up a local search-api instance to point to the TEST indices, at one point all the doc counts got reset to 0.
I submitted a help ticket and chatted with the AWS tech support, we made configuration updates to bring the cluster from Yellow to Green. The internal team verified that the cluster's data nodes are fine and there are no unassigned shards.
I did further investigation and debugging to rule out any causes on my end, and I did finally figured out the root cause. It was a BAD data in our database, which caused infinite loop... That also explained why a small number of documents got indexed and after that no more documents added to the Elasticsearch indices.
Dataset 421007293469db7b528ce6478c00348d
has itself as parent and this caused the index procedure to endlessly loop through this node and would never get to other entities.
I deleted the Activity node (5987bb5d5b7783878448fc4cf3150634
) and the input/output relationships. Also recreated with using the correct director ancestor, which is Sample ee5c22a10c313e58fbfbd11aa2892cf6
.