Duplication allowed in append_without_duplicates when it comes in the input dataframe
Closed this issue · 1 comments
brayanjuls commented
Duplication is allowed when the duplication happens in the dataframe and is not in the table. I.E
Let's say we have the following table:
+----+---------+---------+
| id|firstname| lastname|
+----+---------+---------+
| 1| Benito| Jackson|
| 4| Maria| Pitt|
| 6| Rosalia| Pitt|
+----+---------+---------+
And we want to insert this new dataframe:
+----+---------+---------+
| id|firstname| lastname|
+----+---------+---------+
| 3| Jose| Travolta|
| 8| Jose| Travolta|
+----+---------+---------+
Calling the function with the following parameters will not avoid duplication in the table:
mack.append_without_duplicates(deltaTable, append_df, ["firstname","lastname"])
The resulting table will be:
+----+---------+---------+
| id|firstname| lastname|
+----+---------+---------+
| 1| Benito| Jackson|
| 4| Maria| Pitt|
| 6| Rosalia| Pitt|
| 3| Jose| Travolta|
| 8| Jose| Travolta|
+----+---------+---------+
To avoid this we should also deduplicate the input dataframe before trying to append the new data.
Note: I proposed a solution for this issue on the scala library, https://github.com/MrPowers/jodie/pull/48/files
MrPowers commented
Thanks for reporting @brayanjuls and for fixing @robertkossendey