japila-books/spark-structured-streaming-internals

spark structured stream read to hdfs files fails if data is read immediately

vishal98 opened this issue · 2 comments

we are reading data as static Dataframe after writing it to hdfs using spark structured stream api

df.writeStream.trigger(processingTime='5 seconds')
.foreachBatch(lambda df, partition_id:
df.write.option("path",target_table_dir)
.format("parquet")
.mode("append")
.saveAsTable(target_table)
)
.start()
when we immediately try to read same data back from hive table.we are getting partition not found exception.if we read data with delay, we are seeing correct data it seems spark is still writing data to hdfs while execution has stopped and hive metastore is updated but data is still being written out to hdfs.

can we have way which update hive metastore only once the data is properly written into the hdfs
I have asked same question on stackoverflow

I have asked same question on stackoverflow

Can you give the link? StackOverflow seems a better place to continue this conversation.

I believe it's spark structured stream read to hdfs files fails if data is read immediately
so I'm closing this issue as SO is a much better place to discuss it.