How to use one time load with batches?

Question

How to use one time load with batches?

Rap70r opened this issue 4 years ago · 1 comments

Hello,

Is it possible to run a spark-submit job to loads records in batches with trigger intervals and then terminate the app when all records are consumed?
I have the following config:

metrics:
s3://path/to/metric.yaml
inputs:
cdc_events:
kafka:
servers:
- kafka:9092
topic: test.dbo.some_table
schemaRegistryUrl: schema-registry:8081
options:
startingOffsets: earliest
maxOffsetsPerTrigger: 15000000
output:
hudi:
dir: s3://some/bucket/path
parallelism: 3000
operation: upsert
storageType: COPY_ON_WRITE
maxVersions: 1
options:
hoodie.datasource.hive_sync.enable: false
streaming:
triggerMode: ProcessingTime
triggerDuration: 180 seconds
outputMode: append
checkpointLocation: s3://some/bucket/path/checkpoint
batchMode: true

This runs continuously after all records are consumed from Kafka topic. Is it possible to configure it so it will exit when all records are consumed? Just like when using: triggerMode: Once

Thank you

Answer 1 · 2021-02-20T11:49:15.000Z

I'm not aware of how to do this.
In streaming mode you can't even define the end of the offsets at all, so someone needs to tell spark (it's out of metorikku's control since spark is the one actually waiting for more records) when to stop consuming.
You can look here to see if there's some config I'm missing:
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

You can use a batch query and give it the end offsets if you know them