
scripts to deduplicate annotations and to refine NER spans or to analyze the differences

Primary LanguagePython

Deduplication and consolidation utilities

Simple scripts to deduplicate a jsonl file based on a unique id. Specifically, it's written to deduplicate the annotations from different annotators for example and to refine NER spans or analyze the differences.

Deduplication usage

Usage from python:

from annotations_deduplications import Deduplicator

records = [{"user": ..., "metadata": ...}, ...]

deduplicator = Deduplicator(
    comparators=[("text", is_overlap)],  # list of tuples (attribute, comparator)
    # attributes to block on, nested attributes can be specified with a dot
    blocking_attributes=["user", "metadata.url"],
    clust_kwargs={"eps": 0.1, "min_samples": 2, "metric": "precomputed"},

Optionally, you can specify a custom blocking_rule instead of blocking_attributes (it was done using groupbyrule package):

blocking_rule = Any(Match("attr1", "attr2"),
                    Match("attr2", "attr3"),
                    Match("attr3", "attr4"))
deduplicator = Deduplicator(
    comparators=[("text", is_overlap)],  # list of tuples (attribute, comparator)
    # attributes to block on, nested attributes can be specified with a dot
    clust_kwargs={"eps": 0.1, "min_samples": 2, "metric": "precomputed"},

for cluster_id, duplicates in deduplicator(records):
    print(cluster_id, duplicates)

Scripts usage


python3 -m annotations_deduplications.scripts.find_duplicates_cli \
                       --input_path ../../data/alarab-unlemmatized \ 
                       --output ./duplicates_file.jsonl  \
                       --skipped_records ./skipped_records.jsonl \
                       --unique_id id

If records were skipped due to absense of the unique id, they will be saved in skipped_records.jsonl file. Optionally, possible specify blocking_attributes, default is ["user", "metadata.url"].


A consolidation is a record that contains the aggregated annotations from the duplicates.
It was done using pytextspan package, for more details see pytextspan.
To consolidate the duplicates, you can use the following script using previously generated duplicates_file.jsonl:

python3 -m annotations_deduplications.scripts.make_consolidations_cli \
                       --input_path ../../data/alarab-unlemmatized \ 
                       --duplicates_file ./duplicates_file.jsonl  \
                       --output ./consolidations.jsonl \
                       --unique_id id