Precision over 100% reported if ground truth contains pairs of identical ids

We have a dirty ER workflow, where the EntityMatching graph is generated with similarity_threshold=0.0 (to get all compared edges) and then we optimize the clustering for the optimal similarity_threshold using optuna. We encountered this:

On the top end, where the threshold gets towards 1.0 and as such the clustering does not produce alot of matches, the reported precision goes beyond 100%. I would have to dig deeper what exactly causes this, but maybe you have an idea, possibly it is only a bug regarding edge cases where the number of matches is low.

best

Some more data. From

pyJedAI/src/pyjedai/clustering.py

Line 366 in 2e41af4

eval_obj.calculate_scores(true_positives=true_positives)

I printed eval_obj.__dict__:

{'total_matching_pairs': 76.0, 'data': <pyjedai.datamodel.Data object at 0x7e11d1839db0>, 'true_positives': 102, 'true_negatives': 185456764.0, 'false_positives': -26.0, 'false_negatives': 553360, 'all_gt_ids': {0, 1, 2, [...], 19316}, 'num_of_true_duplicates': 553462, 'precision': 1.3421052631578947, 'recall': 0.00018429449537637633, 'f1': 0.00036853838399531744}

So total_matching_pairs is smaller than true_positives.

Ah I got it. We have matching pairs of the same id in our ground truth. So sth. like "id1|id1" as row in the csv file. Thinking about it, this is not incorrect: An entity obviously is identical to itself, but I see also that the gt is not as clean as it should be. I will cleanup the gt, but an additional approach might be to check for identity of the ids here:

pyJedAI/src/pyjedai/clustering.py

Line 362 in 2e41af4

if id1 in entity_index and \

and in that case not increase true_positives to make evaluation more robust. But of course, one would need to investigate also for clean clean ER case and the other steps' evaluations, that calculations remain correct / consistent.

We hadn't considered this scenario before. I fully agree that it should be addressed, given the prevalence of errors in data. We will address this by adding a validation check.

Thanks for the detailed trace and feedback!

We added a drop_duplicates when we parse the GT file. Here:

pyJedAI/src/pyjedai/datamodel.py

Line 159 in c19399a

self.ground_truth.drop_duplicates(inplace=True)

I think this will work better.

Cheers,
Konstantinos