Possible deduplication solution that doesn't require a primary key
MrPowers opened this issue · 2 comments
MrPowers commented
We'll have to translate this to Python:
val duplicates = df
.select(<pk cols>)
.withColumn("__file_path", col("_metadata.file_path"))
.withColumn("__row_index", col("_metadata.row_index"))
.withColumn(
"rank",
row_number().over(
Window()
.partitionBy(<pk cols>)
.orderBy(<pk cols>)))
.filter("rank > 1")
.drop("rank")
And then:
df.alias("old")
.merge(
duplicates.alias("new"),
"old.<pk1> = new.<pk1> AND ... AND old.<pkn> = new.<pkn>" +
" AND old._metadata.file_path = new.__file_path" +
" AND old._metadata.row_index = new.__row_index")
.whenMatchedDelete()
.execute()
robertkossendey commented
Where is the row_index property documented?
robertkossendey commented
Ahh, found it! ;) https://issues.apache.org/jira/browse/SPARK-37980