amphi-ai/amphi-etl

Unique key detector tool

Opened this issue · 1 comments

Hello,

This is a feature I haven't seen in any data prepation/etl. The core feature is to detect the unique key in a dataframe. More than often, you have to deal with a dataset without knowing what's make a row unique. This can lead to misinterpret the data, cartesian product at join and other funny stuff.

How do I imagine that ?

Entry; one dataframe, ability to select fields or check all, ability to specify a max number of field for combination (empty or 0=no max).
Algo : it tests the count distinct every combination of field versus the count of rows

Result : one row by field combination that works. If no result : "no field combination is unique. check for duplicate or need for aggregation upstream".

ex :

<style> </style>
order_id line_id amount customer site
1 1 100 A U_250
1 2 12 A U_250
1 3 45 A U_250
2 1 75 A U_250
2 2 12 A U_250
3 1 15 B U_250
4 1 45 B U_251

The user will select every field but excluding Amount (he knows that Amount would have no sense in key)

The algo will test the following key
-each separate field
-each combination of two fields
-each combination of three fields
-each combination of four fields

to match the number of row (7)
And gives something like that

<style> </style>
choice number of fields field combination
very good 2 order_id,line_id
average 3 order_id,line_id, customer
average 3 order_id,line_id, site
bad 4 order_id,line_id, site, customer
….

Best regards,

Simon

Thanks @simonaubertbd, that's actually a really good idea! The algo seems straighforward! I don't mark it as a priority for now but would be a great nice to have :)