overwrite at partitionby folder level
Closed this issue · 2 comments
output dir: x/y
partition by colTicker
new source data for colTicker can appear everyday, sometimes new values sometimes existing values.
example:
day1 had 2 files under below folder structure:
colTicker=AMZN
colTicker=MSFT (lets say 150 rows in the file)
day2 had 1 file:
colTicker=GOOG
day3 had 3 files:
colTicker=MSFT (lets say 120 rows in the file)
colTicker=IBM
colTicker=LYFT
i want to replace data (so not storing 2 days versions/duplicates) for same colTicker but keep data for all distinct colTicker partitions
so my expected output data after each day:
day1 output data under outputdir:
colTicker=AMZN
colTicker=MSFT (150)
day2 output data under outputdir:
colTicker=AMZN (existing data should be untouched, was loaded in day1 only, never removed)
colTicker=MSFT (150)
colTicker=GOOG
day3 output data under outputdir:
colTicker=AMZN (existing data should be untouched, was loaded in day1 only, never removed)
colTicker=MSFT (120, previous 150 should be overwritten)
colTicker=GOOG (existing data should be untouched, was loaded in day1 only, never removed)
colTicker=IBM
colTicker=LYFT
issue is:
if i try savemode=append then after day3 MSFT shows 270 rows
if i try savemode=overwrite then after day3 AMZN/GOOG data is gone
does metorikku support .config("spark.sql.sources.partitionOverwriteMode", "dynamic") ? ie https://stackoverflow.com/a/56570869/8874837
Definately simply send it as part of the spark-submit --conf spark.sql.sources.partitionOverwriteMode= dynamic or if you're running the standalone use -Dspark.sql.sources.partitionOverwriteMode= dynamic