YotpoLtd/metorikku

overwrite at partitionby folder level

Closed this issue · 2 comments

output dir: x/y
partition by colTicker

new source data for colTicker can appear everyday, sometimes new values sometimes existing values.

example:
day1 had 2 files under below folder structure:
colTicker=AMZN
colTicker=MSFT (lets say 150 rows in the file)

day2 had 1 file:

colTicker=GOOG

day3 had 3 files:
colTicker=MSFT (lets say 120 rows in the file)
colTicker=IBM
colTicker=LYFT

i want to replace data (so not storing 2 days versions/duplicates) for same colTicker but keep data for all distinct colTicker partitions

so my expected output data after each day:

day1 output data under outputdir:
colTicker=AMZN
colTicker=MSFT (150)

day2 output data under outputdir:
colTicker=AMZN (existing data should be untouched, was loaded in day1 only, never removed)
colTicker=MSFT (150)
colTicker=GOOG

day3 output data under outputdir:
colTicker=AMZN (existing data should be untouched, was loaded in day1 only, never removed)
colTicker=MSFT (120, previous 150 should be overwritten)
colTicker=GOOG (existing data should be untouched, was loaded in day1 only, never removed)
colTicker=IBM
colTicker=LYFT

issue is:
if i try savemode=append then after day3 MSFT shows 270 rows
if i try savemode=overwrite then after day3 AMZN/GOOG data is gone

does metorikku support .config("spark.sql.sources.partitionOverwriteMode", "dynamic") ? ie https://stackoverflow.com/a/56570869/8874837

Definately simply send it as part of the spark-submit --conf spark.sql.sources.partitionOverwriteMode= dynamic or if you're running the standalone use -Dspark.sql.sources.partitionOverwriteMode= dynamic