The 'with_input_file_name_col' option doesn't work with File offsets
yruslan opened this issue · 2 comments
Hi @yruslan
The chosen approach - as per recommendation and assistance of the Clients Mainframe specialist (Pascale) - was to adapt the copybook.
with info in the readme and input of prior issues issue 153 and issue 72 I managed to to successfully ectract the desired data and omit header and footer.
`
import org.apache.spark.sql.functions._
// import org.apache.spark.sql.SparkSession
// adapted the copybook as per recommendation of Clients Mainframe specialist (Pascale)
//
// used a redefine on 2nd level (non-root level)
//
// approach based on following input
// https://github.com/AbsaOSS/cobrix#automatic-segment-redefines-filtering
// #153
// #72
spark.udf.register("get_file_name", (path: String) => path.split("/").last)
val cobolDataframe = spark
.read
.format("za.co.absa.cobrix.spark.cobol.source")
.option("schema_retention_policy", "collapse_root")
.option("segment_field", "REC_GSH_STUB_IDENT")
.option("segment_id_level0", "G")
.option("segment_id_level1", "2")
.option("segment_id_level2", "C")
.option("redefine_segment_id_map:0", "REC-GSH-STUB => G")
.option("redefine_segment_id_map:1", "REC-GSH => C")
.option("redefine_segment_id_map:2", "REC-GSH-STUB => 2")
.option("pedantic", "true")
.option("copybook", "file:///home/jovyan/data/BRAND/COPYBOOK_redefine_on_level_2.txt")
.load("file:///home/jovyan/data/BRAND/initial_transformed")
.withColumn("DPSource", callUDF("get_file_name", input_file_name()))
`
a simplified version of the copybook is
01 REC-GSH-GLOBAL. * 03 REC-GSH-STUB. 05 REC-GSH-STUB-IDENT PIC X(1). 05 REC-GSH-STUB-REST PIC X(599). * 03 REC-GSH REDEFINES REC-GSH-STUB. 05 REC-GSH-REAL-IDENT PIC X(1). 05 REC-GSH-REAL-REST PIC X(599).
I thank you very much for the assistance and recommend to close the issue.
A the entire adapted copybook will be shared with you
Thanks in advance,
Bart Debersaques,
Glad you've found a workaround. Nevertheless, .option("with_input_file_name_col", "DPSource")
could still be used with .option("file_start_offset", 100)
or .option("file_end_offset", 100)
after this fix is released.