RedisLabs/spark-redis

Append mode doesn't replace entire row if key collision happens

Opened this issue · 1 comments

What did I do

df = (
  spark
  .sql("SELECT 'test' AS key, 123 AS col_a, 223 AS col_b")
)
(
  df
  .write
  .format("org.apache.spark.sql.redis")
  .option("host", redis_host)
  .option("port", redis_port)
  .option("ssl", "true")
  .option("table", "test_append_behavour")
  .option("key.column", "key")
  .mode("overwrite")
  .save()
)
r = redis.Redis(host=redis_host, port=redis_port, db=0, ssl=True)
print(r.hgetall("test_append_behavour:test"))

# {b'col_b': b'223', b'col_a': b'123'}
df2 = (
  spark
  .sql("SELECT 'test' AS key, 324 AS col_a, 423 AS col_c")
)
(
  df2
  .write
  .format("org.apache.spark.sql.redis")
  .option("host", redis_host)
  .option("port", redis_port)
  .option("ssl", "true")
  .option("table", "test_append_behavour")
  .option("key.column", "key")
  .mode("append")
  .save()
)
r = redis.Redis(host=redis_host, port=redis_port, db=0, ssl=True)
print(r.hgetall("test_append_behavour:test"))

# {b'col_b': b'223', b'col_a': b'324', b'col_c': b'423'}

What did I see

test_append_behavour:test now has 3 fields

{b'col_b': b'223', b'col_a': b'324', b'col_c': b'423'}

What did I expect

test_append_behavour:test should only have 2 fields from df2

{b'col_a': b'324', b'col_c': b'423'}

Please note, when key collision happens and SaveMode.Append is set, the former row is replaced with a new one.

According to the docs, the row of df1 should be replace by df2 in append mode, because they share the same key.

However, the col_a from df1 is still there after append, that means not entire row is replaced. We only replace the field if there is any key collision.

fe2s commented

Hi @tsekityam ,
the SaveMode.Append uses hmset command internally, so it may not completely overwrite the row if the scheme of new dataframe is different. You are right, the documentation is not accurate.