/deduplicate-elasticsearch

Remove duplicate documents from Elasticsearch

Primary LanguagePythonMIT LicenseMIT

deduplicate-elasticsearch

A python script to detect duplicate documents in Elasticsearch. Once duplicates have been detected, it is straightforward to call a delete operation to remove duplicates.

For a full description on how this script works including an analysis of the memory requirements, see: https://alexmarquardt.com/2018/07/23/deduplicating-documents-in-elasticsearch/

Based on the original script, add the implementation logic of deleting duplicate documents, and leave a document