AbsaOSS/cobrix

record_format VB file fails with length of BDW block is too big

Opened this issue · 7 comments

When converting a Variable Block format EBCDIC file, I got the error "The length of BDW block is too big", tried with following option but still getting same error.

dataframe = spark.read.format("cobol").
option("copybook", util_params["copybook_path"]).
option("encoding", "ebcdic").
option("schema_retention_policy", "collapse_root").
option("record_format", "VB").
option("is_bdw_big_endian", "true").
option("is_rdw_big_endian", "true").
option("bdw_adjustment", -4) .
option("rdw_adjustment", -4) .
option("generate_record_id", True).
load(file_path)

Error:

WARN BlockManager: Putting block rdd_1_0 failed due to exception java.lang.IllegalStateException: The length of BDW block is too big. Got 1223880942. Header: 200,242,240,242, offset: 0..
WARN BlockManager: Block rdd_1_0 could not be removed as it was not found on disk or in memory
ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalStateException: The length of BDW block is too big. Got 1223880942. Header: 200,242,240,242, offset: 0.

Please suggest some way to fix this issue. Can you please share the example where you have tested the VB scenario with EBCDIC file and copybook for reference.

Hi @yruslan

Thanks for the reply. We already gone through above code but didn't get any solution to our problem, as mentioned above whenever we trying to use VB option with the adjustment, we are getting "The length of BDW block is too big" error.

dataframe = spark.read.format("cobol").
option("copybook", util_params["copybook_path"]).
option("encoding", "ebcdic").
option("schema_retention_policy", "collapse_root").
option("record_format", "VB").
option("is_bdw_big_endian", "true").
option("is_rdw_big_endian", "true").
option("bdw_adjustment", -4) .
option("rdw_adjustment", -4) .
option("generate_record_id", True).
load(file_path)

WARN BlockManager: Putting block rdd_1_0 failed due to exception java.lang.IllegalStateException: The length of BDW block is too big. Got 1223880942. Header: 200,242,240,242, offset: 0..
WARN BlockManager: Block rdd_1_0 could not be removed as it was not found on disk or in memory
ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalStateException: The length of BDW block is too big. Got 1223880942. Header: 200,242,240,242, offset: 0.

Can you please provide some solution for above issue.

The generic approach is to try to simulate the record header parser manually in a Hex Editor in order to understand headers of your file. Things like:

  • What is the first BDW header and RDW header.
  • What i the offset of the second RDW of the first block.
  • What is the offset and value of the next BDW header.
    Based on this you can determine if you need to apply any adjustments to BDW and/or RDW.

The error message indicates that the record extractor encountered a wrong BDW block. This can happen when there is no BDW header at the specified offset.

Now i've noticed that the error happens at the offset 0. Are you sure your file has BDW+RDW headers?

What are the first 8 bytes of your file?

I am having similar error. My file is located at https://raw.githubusercontent.com/jaysara/spark-cobol-jay/main/data/ebcdic_bdwrdw.dat

This is how this file is parsed outside of spark.
Record -1:
BDW = 430
RDW = 426
HEADER Record Type = 422
Record -2 :
BDW = 968
RDW = 964
BASE-SEGMENT Record Type = 960
Record -3 :
BDW = 768
RDW = 764
BASE-SEGMENT Record Type = 760
Record -4 :
BDW = 1034
RDW = 1030
BASE-SEGMENT Record Type = 1026
.......
(the last record is TRAILER. )
Record -12 :
BDW = 430
RDW = 426
TRAILER Record Size = 420

This file has total 12 Records (including Header and Trailer). I am using folllowign

        Dataset<Row> df1 =  spark.read()
                .format("za.co.absa.cobrix.spark.cobol.source")
                .option("copybook_contents", copybook)
                .option("encoding", "ebcdic")
                .option("record_format", "VB") // Variable length records
                .option("is_rdw_big_endian", "true")
                .option("is_bdw_big_endian", "true")

                .option("schema_retention_policy", "collapse_root")
                .option("bdw_adjustment", -4)
                .option("rdw_adjustment", -4)

This is my copybook contents

copybook =
                        01  RECORD.
                                   05  BASE-SEGMENT                   PIC X(123)

The above file has BDW only for one record. That may not be the typical case. (more commoly)We will have a BDW for multiple records.. e.g

Record -2 :
BDW = 1732
RDW = 964
BASE-SEGMENT Record size = 960
Record -3 :
RDW = 764
BASE-SEGMENT Record size = 760

What else should I specify? Here is the error that I get,

Caused by: java.lang.IllegalStateException: The length of BDW block is too big. Got 1895101420. Header: 240,244,243,240, offset: 0.
	at za.co.absa.cobrix.cobol.reader.recordheader.RecordHeaderDecoderCommon.reportTooLargeBlockLength(RecordHeaderDecoderCommon.scala:53)
	at za.co.absa.cobrix.cobol.reader.recordheader.RecordHeaderDecoderBdw.validateBlockLength(RecordHeaderDecoderBdw.scala:86)
	at za.co.absa.cobrix.cobol.reader.recordheader.RecordHeaderDecoderBdw.getRecordLength(RecordHeaderDecoderBdw.scala:48)

Hi,

The example file starts with 0xC3 0xB0 0xC3 0xB4 (which is the same are reported by the error message Header: 240,244,243,240).

Please, clarify how did you parse the file to get BDW=430, RDW=426? Which bytes of the file?

I apologize. I made an error in uploading the file. The ebcdic file with BDW and RDW is at https://raw.githubusercontent.com/jaysara/spark-cobol-jay/main/data/bdw-rdw-sample-ebcdic.dat
The ASCII equivalent of this file is at https://raw.githubusercontent.com/jaysara/spark-cobol-jay/main/data/bdw-rdw-sample.txt

Here are the read options that I use,

Dataset<Row> df1 =  spark.read()
                .format("za.co.absa.cobrix.spark.cobol.source")
                .option("copybook_contents", copybook)
                .option("encoding", "ebcdic")
                .option("record_format", "VB") // Variable length records
                .option("is_rdw_big_endian", "false")
                .option("is_bdw_big_endian", "false")
                .option("schema_retention_policy", "collapse_root")
                .option("bdw_adjustment", -4)
                .option("rdw_adjustment", -4)
                .load("/Users/jsaraiy/Sandbox/spark-cobol-jay/data/ebcdic-bdw-rdw.dat");

I get following error ,

Caused by: java.lang.IllegalStateException: The length of BDW block is too big. Got 1961947628. Header: 240,241,240,244, offset: 0.
	at za.co.absa.cobrix.cobol.reader.recordheader.RecordHeaderDecoderCommon.reportTooLargeBlockLength(RecordHeaderDecoderCommon.scala:53)
	at za.co.absa.cobrix.cobol.reader.recordheader.RecordHeaderDecoderBdw.validateBlockLength(RecordHeaderDecoderBdw.scala:86)

If I change the record_format to V from VB
.option("record_format", "V") //
The program runs w/out error, however, it does not parse out the Segments correctly. it all comes as one row like below.
+--------------------+
| SEGMENT|
+--------------------+
|0100HEADER 3 NOT ...|
+--------------------+

Hi,
The corrected files also do not have neither BDW or RDW headers. BDW and RDW headers are binary fields, while your file contains only text fields.

More on BDW headers: https://www.ibm.com/docs/en/zos/2.1.0?topic=records-block-descriptor-word-bdw
More on RDW headers: https://www.ibm.com/docs/en/zos/2.1.0?topic=records-record-descriptor-word-rdw

If the file has variable length records, these are options available:

  • record_format = V if the file has RDW headers
  • record_format = VB if the file contains both BDW and RDW headers
  • record_format = D if the file is ASCII and records are separated by line ending characters
  • record_format = V, and record_length_field = if record length can be derived from a field in the copybook via an arithmetic expression
  • A custom record extractor can be used if the logic of determining the record length for each record is custom and nothing from the above works (https://github.com/AbsaOSS/cobrix?tab=readme-ov-file#custom-record-extractors)

From my experience, quite often the team that handles copying of data from the mainframe can adjust conversion options to include RDW headers. This is the most reliable way of getting the data as accurate as possible.