MrPowers/mack

Duplication allowed in append_without_duplicates when it comes in the input dataframe

Closed this issue · 1 comments

Duplication is allowed when the duplication happens in the dataframe and is not in the table. I.E

Let's say we have the following table:

+----+---------+---------+
|  id|firstname| lastname|
+----+---------+---------+
|   1|   Benito|  Jackson|
|   4|    Maria|     Pitt|
|   6|  Rosalia|     Pitt|
+----+---------+---------+

And we want to insert this new dataframe:

+----+---------+---------+
|  id|firstname| lastname|
+----+---------+---------+
|   3|     Jose| Travolta|
|   8|     Jose| Travolta|
+----+---------+---------+

Calling the function with the following parameters will not avoid duplication in the table:

mack.append_without_duplicates(deltaTable, append_df, ["firstname","lastname"])

The resulting table will be:

+----+---------+---------+
|  id|firstname| lastname|
+----+---------+---------+
|   1|   Benito|  Jackson|
|   4|    Maria|     Pitt|
|   6|  Rosalia|     Pitt|
|   3|     Jose| Travolta|
|   8|     Jose| Travolta|
+----+---------+---------+

To avoid this we should also deduplicate the input dataframe before trying to append the new data.

Note: I proposed a solution for this issue on the scala library, https://github.com/MrPowers/jodie/pull/48/files

Thanks for reporting @brayanjuls and for fixing @robertkossendey