AbsaOSS/cobrix

The 'with_input_file_name_col' option doesn't work with File offsets

yruslan opened this issue · 2 comments

Describe the bug

The 'with_input_file_name_col' option doesn't work with File offsets.

See #221.

Hi @yruslan

The chosen approach - as per recommendation and assistance of the Clients Mainframe specialist (Pascale) - was to adapt the copybook.

with info in the readme and input of prior issues issue 153 and issue 72 I managed to to successfully ectract the desired data and omit header and footer.

`
import org.apache.spark.sql.functions._

// import org.apache.spark.sql.SparkSession

// adapted the copybook as per recommendation of Clients Mainframe specialist (Pascale)
//
// used a redefine on 2nd level (non-root level)
//
// approach based on following input
// https://github.com/AbsaOSS/cobrix#automatic-segment-redefines-filtering
// #153
// #72

spark.udf.register("get_file_name", (path: String) => path.split("/").last)

val cobolDataframe = spark
.read
.format("za.co.absa.cobrix.spark.cobol.source")
.option("schema_retention_policy", "collapse_root")
.option("segment_field", "REC_GSH_STUB_IDENT")
.option("segment_id_level0", "G")
.option("segment_id_level1", "2")
.option("segment_id_level2", "C")
.option("redefine_segment_id_map:0", "REC-GSH-STUB => G")
.option("redefine_segment_id_map:1", "REC-GSH => C")
.option("redefine_segment_id_map:2", "REC-GSH-STUB => 2")
.option("pedantic", "true")
.option("copybook", "file:///home/jovyan/data/BRAND/COPYBOOK_redefine_on_level_2.txt")
.load("file:///home/jovyan/data/BRAND/initial_transformed")
.withColumn("DPSource", callUDF("get_file_name", input_file_name()))
`
a simplified version of the copybook is

01 REC-GSH-GLOBAL. * 03 REC-GSH-STUB. 05 REC-GSH-STUB-IDENT PIC X(1). 05 REC-GSH-STUB-REST PIC X(599). * 03 REC-GSH REDEFINES REC-GSH-STUB. 05 REC-GSH-REAL-IDENT PIC X(1). 05 REC-GSH-REAL-REST PIC X(599).

I thank you very much for the assistance and recommend to close the issue.

@kriswijnants

A the entire adapted copybook will be shared with you

Thanks in advance,

Bart Debersaques,

Glad you've found a workaround. Nevertheless, .option("with_input_file_name_col", "DPSource") could still be used with .option("file_start_offset", 100) or .option("file_end_offset", 100) after this fix is released.