MrPowers/mack

Possible deduplication solution that doesn't require a primary key

MrPowers opened this issue · 2 comments

We'll have to translate this to Python:

val duplicates = df
  .select(<pk cols>)
  .withColumn("__file_path", col("_metadata.file_path"))
  .withColumn("__row_index", col("_metadata.row_index"))
  .withColumn(
    "rank", 
    row_number().over(
      Window()
        .partitionBy(<pk cols>)
        .orderBy(<pk cols>)))
  .filter("rank > 1")
  .drop("rank")

And then:

df.alias("old")
  .merge(
    duplicates.alias("new"),
    "old.<pk1> = new.<pk1> AND ... AND old.<pkn> = new.<pkn>" +
      " AND old._metadata.file_path = new.__file_path" +
      " AND old._metadata.row_index = new.__row_index")
  .whenMatchedDelete()
  .execute()

Where is the row_index property documented?