Precision over 100% reported if ground truth contains pairs of identical ids
mrckzgl opened this issue · 4 comments
We have a dirty ER workflow, where the EntityMatching graph is generated with similarity_threshold=0.0
(to get all compared edges) and then we optimize the clustering for the optimal similarity_threshold
using optuna. We encountered this:
On the top end, where the threshold gets towards 1.0 and as such the clustering does not produce alot of matches, the reported precision goes beyond 100%. I would have to dig deeper what exactly causes this, but maybe you have an idea, possibly it is only a bug regarding edge cases where the number of matches is low.
best
Some more data. From
pyJedAI/src/pyjedai/clustering.py
Line 366 in 2e41af4
I printed
eval_obj.__dict__
:
{'total_matching_pairs': 76.0, 'data': <pyjedai.datamodel.Data object at 0x7e11d1839db0>, 'true_positives': 102, 'true_negatives': 185456764.0, 'false_positives': -26.0, 'false_negatives': 553360, 'all_gt_ids': {0, 1, 2, [...], 19316}, 'num_of_true_duplicates': 553462, 'precision': 1.3421052631578947, 'recall': 0.00018429449537637633, 'f1': 0.00036853838399531744}
So total_matching_pairs
is smaller than true_positives
.
Ah I got it. We have matching pairs of the same id in our ground truth. So sth. like "id1|id1" as row in the csv file. Thinking about it, this is not incorrect: An entity obviously is identical to itself, but I see also that the gt is not as clean as it should be. I will cleanup the gt, but an additional approach might be to check for identity of the ids here:
pyJedAI/src/pyjedai/clustering.py
Line 362 in 2e41af4
and in that case not increase
true_positives
to make evaluation more robust. But of course, one would need to investigate also for clean clean ER case and the other steps' evaluations, that calculations remain correct / consistent.We hadn't considered this scenario before. I fully agree that it should be addressed, given the prevalence of errors in data. We will address this by adding a validation check.
Thanks for the detailed trace and feedback!
We added a drop_duplicates when we parse the GT file. Here:
pyJedAI/src/pyjedai/datamodel.py
Line 159 in c19399a
I think this will work better.
Cheers,
Konstantinos