Null vs Blank Issue

Question

Null vs Blank Issue

ilaria-ia opened this issue 4 years ago · 35 comments

I'm using the cobrix library to read an EBDCID binary file, obtained as a table extraction on DB2. The reading operation output is writed as CSV file on HDFS.
For parsing the binary file I'm using a copybook file present on HDFS.
Here is the spark reading command:
spark.read.
format (sourceFileFormat).
option ("copybook", copybookStagingHDFSPath).
option ("schema_retention_policy", "collapse_root").
option ("string_trimming_policy", "none").
load (sourceFileStagingHDFSPath + file)
The resulting output on CSV file does not allow the distinction between BLANK ("") and NULL cells but are both treated as empty strings.

Is there a way to treat BLANK ("") value and NULL value differently, in order to have an output corresponding to the data present on the DB2 source database?

Answer 1 · 2021-01-27T18:50:20.000Z

Hi,

Cobrix does distinguish between nulls and empty values.
It's up to the CSV writer to write different values for nulls and empty values.

You can achieve this by specifying a value that should be written to a CSV field when the input value is null.

df.write
  .option("nullValue", "NULL")
  .csv("/some/path")

Answer 2 · 2021-01-29T10:15:48.000Z

Hi,

Thanks for the reply. I tested the writing mode you recommended, but the problem did not go away. Executing a show on the dataframe created by the read operation and comparing it with the source table on DB2, it was verified that the null cells on the source are read blank in the dataframe. So the error is being read not writing. In fact, since null cells are not created in the dataframe, the option option ("nullValue", "NULL") does not generate null values in the csv written on HDFS.
For the sake of completeness, I report an example to help understand the difference between db2 and post-reading dataframe.

Examble_DB2vsDataframePostRead.xlsx

This is read operation:
val sourceFileDF: (Int, DataFrame) = Try.apply({
spark.read.
format(sourceFileFormat).
option("copybook", copybookStagingHDFSPath).
option("schema_retention_policy", "collapse_root").
option("string_trimming_policy", "left").
option("inferSchema",false).
load(sourceFileStagingHDFSPath + file)

Is it possible that I need to add special options to fix the problem?

Answer 3 · 2021-01-29T10:27:52.000Z

I see. What are the data types of the problematic columns? How their PICs look like in teh copybook?

I have a suspicion that it might be the limitation of the mainframe format itself. It might be that empty strings are encoded exactly the same as nulls on the export from DB2. If this is the case, there is nothing we can do.

Answer 4 · 2021-01-29T11:45:37.000Z

Data tyoes and PICs of columns are:
campo1 PIC X(20). CHAR(20) FOR SBCS DATA WITH DEFAULT NULL
FILLER PIC X(1).
campo2 PIC X(4). CHAR(4) FOR SBCS DATA WITH DEFAULT NULL
FILLER PIC X(1).
campo3 PIC X(102). VARCHAR(100) FOR SBCS DATA WITH DEFAULT NULL
FILLER PIC X(1).
campo4 PIC X(8). CHAR(8) FOR SBCS DATA WITH DEFAULT NULL
FILLER PIC X(1).
campo5 PIC X(2). CHAR(2) FOR SBCS DATA WITH DEFAULT NULL
FILLER PIC X(1).
campo6 PIC X(3). CHAR(3) FOR SBCS DATA WITH DEFAULT NULL
FILLER PIC X(1).
campo7 PIC X(1). CHAR(1) FOR SBCS DATA WITH DEFAULT NULL
FILLER PIC X(1).
campo8 PIC X(10). DATE WITH DEFAULT NULL
FILLER PIC X(1).
campo9 PIC X(20). CHAR(20) FOR SBCS DATA WITH DEFAULT NULL
FILLER PIC X(1).
campo10 PIC X(102). VARCHAR(100) FOR SBCS DATA WITH DEFAULT NULL
FILLER PIC X(1).

Answer 5 · 2021-01-29T13:27:15.000Z

That's what I was suspected. Fields that are encoded as 'PIC X' don't support the notion of NULLs. So when you export a table from DB2 to a mainframe file, both empty strings and nulls are encoded as "40404040" (40 is space in EBCDIC). If encoded values are different, it might be possible to differentiate it.

Could I ask you to run with .option("debug", "true") to see how empty strings and nulls are encoded?

Answer 6 · 2021-01-29T14:21:05.000Z

Is this option valid for spark-cobol version 0.2.5? I have spark 2.2.0

Answer 7 · 2021-01-29T14:25:03.000Z

No, this option has been introduced quite recently. But you can try spark-cobol 2.2.0 with Spark 2.2.0.

Answer 8 · 2021-01-29T14:50:55.000Z

Unfortunately, with version 2.2.0 I always have this error which I cannot resolve:
Exception in thread "main" java.lang.NoClassDefFoundError: org/antlr/v4/runtime/CharStreams

Answer 9 · 2021-01-29T14:56:24.000Z

This might be easy to solve. :)
Get this jar: antlr4-runtime-4.7.2.jar and add it to spark-submit / spark-shell with --jars

Answer 10 · 2021-01-29T15:15:48.000Z

I did this, but error is
INFO source.DefaultSource: Cobrix 'spark-cobol' build 2.2.0 (2020-12-30T10:25:41)
ANTLR Tool version 4.7.2 used for code generation does not match the current runtime version 4.5.3ANTLR Runtime version 4.7.2 used for parser compilation does not match the current runtime version 4.5.3Exception in thread "main" java.lang.ExceptionInInitializerError

Answer 11 · 2021-01-29T15:24:47.000Z

It's strange. What's your command line?

Answer 12 · 2021-01-29T15:28:52.000Z

I checked. It seems older Spark versions used different ANTLR and they provide it in runtime.
You can solve this by compiling the latest Cobrix from sources replacing Spark version.

But that won't solve nulls issue. Alternatively, you can use a HEX editor to look into the data file and check if nulls are encoded differently than empty strings.

Answer 13 · 2021-01-29T19:13:09.000Z

If i replace FILLER with this format copybook:
FIELD PIC X(20).
FIELD-NULL REDEFINES
FIELD PIC X(20).
do i have to set a particular option?

Answer 14 · 2021-01-30T08:37:06.000Z

No additional options are necessary here. But what are you trying to achieve?

Answer 15 · 2021-02-01T08:04:30.000Z

I created the copybook file starting from DDL and LOAD file which indicates datatype, number of occupied positions and NULLIF position for each field. I translated NULLIF with FILLER in copybook, and I want and I would like to try to use REDEFINES instead of FILLER, to understand if the problem is the copybook file used.

Answer 16 · 2021-02-02T12:15:32.000Z

Hi,
I tried to use the copy file attached. The cobrix library gives me the following error:
scala.MatchError: 5 (of class java.lang.Integer)
at za.co.absa.cobrix.cobol.parser.encoding.EBCDIC.codec(EBCDIC.scala:34)
at za.co.absa.cobrix.cobol.parser.ast.Primitive.getBinarySizeBits(Primitive.scala:90)
at za.co.absa.cobrix.cobol.parser.CopybookParser$.za$co$absa$cobrix$cobol$parser$CopybookParser$$calculatePrimitiveSize$1(CopybookParser.scala:203)
at za.co.absa.cobrix.cobol.parser.CopybookParser$$anonfun$calculateSchemaSizes$1.apply$mcVI$sp(CopybookParser.scala:233)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at za.co.absa.cobrix.cobol.parser.CopybookParser$.calculateSchemaSizes(CopybookParser.scala:214)
at za.co.absa.cobrix.cobol.parser.CopybookParser$.za$co$absa$cobrix$cobol$parser$CopybookParser$$calculateGroupSize$1(CopybookParser.scala:195)
at za.co.absa.cobrix.cobol.parser.CopybookParser$$anonfun$calculateSchemaSizes$1.apply$mcVI$sp(CopybookParser.scala:232)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at za.co.absa.cobrix.cobol.parser.CopybookParser$.calculateSchemaSizes(CopybookParser.scala:214)
at za.co.absa.cobrix.cobol.parser.CopybookParser$.za$co$absa$cobrix$cobol$parser$CopybookParser$$calculateGroupSize$1(CopybookParser.scala:195)
at za.co.absa.cobrix.cobol.parser.CopybookParser$$anonfun$calculateSchemaSizes$1.apply$mcVI$sp(CopybookParser.scala:232)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at za.co.absa.cobrix.cobol.parser.CopybookParser$.calculateSchemaSizes(CopybookParser.scala:214)
at za.co.absa.cobrix.cobol.parser.CopybookParser$.calculateBinaryProperties(CopybookParser.scala:181)
at za.co.absa.cobrix.cobol.parser.CopybookParser$.parseTree(CopybookParser.scala:153)
at za.co.absa.cobrix.spark.cobol.reader.fixedlen.FixedLenNestedReader.loadCopyBook(FixedLenNestedReader.scala:74)
at za.co.absa.cobrix.spark.cobol.reader.fixedlen.FixedLenNestedReader.(FixedLenNestedReader.scala:41)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createFixedLengthReader(DefaultSource.scala:82)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.buildEitherReader(DefaultSource.scala:69)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:50)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:45)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:307)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
... 48 elided

Should I enter additional options for this copybook?

table.txt

Answer 17 · 2021-02-07T20:38:10.000Z

Hi, Sorry for the late reply. What version of Cobrix are you using? and what exact options are you passing to Spark?

This is a very strange error.

Answer 18 · 2021-02-08T08:00:15.000Z

These are dependencies:

<dependency>
    <groupId>za.co.absa.cobrix</groupId>
    <artifactId>spark-cobol</artifactId>
    <version>0.2.5</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.scodec/scodec-core -->
<dependency>
    <groupId>org.scodec</groupId>
    <artifactId>scodec-core_2.11</artifactId>
    <version>1.10.3</version>
</dependency>

<!-- https://mvnrepository.com/artifact/org.scodec/scodec-bits -->
<dependency>
    <groupId>org.scodec</groupId>
    <artifactId>scodec-bits_2.11</artifactId>
    <version>1.1.4</version>
</dependency>

<!-- https://mvnrepository.com/artifact/za.co.absa.cobrix/cobol-parser -->
<dependency>
    <groupId>za.co.absa.cobrix</groupId>
    <artifactId>cobol-parser</artifactId>
    <version>0.2.5</version>
</dependency>

Spark read command is:
spark.read.
format("za.co.absa.cobrix.spark.cobol.source").
option("copybook", "file:///.../file.copy").
option("schema_retention_policy","collapse_root").
load("/path/file_20210119140000")

Answer 19 · 2021-02-09T15:22:20.000Z

Before I take a look. Why do you use 0.2.5? The latest version that does not use ANTLR runtime is 0.5.6. Please, try 0.5.6 first.

Also, you don't need to include scodec and cobol-parser deendencies. They are transitive and will be added automatically.

Answer 20 · 2021-02-10T08:11:43.000Z

Thanks a lot for the tip. I changed the version and the error changed. The mistake is:

There are some files in hdfs://path/fileBin that are NOT DIVISIBLE by the RECORD SIZE calculated from the copybook (594 bytes per record)._

I used this command:
spark.read.
format("za.co.absa.cobrix.spark.cobol.source").
option("copybook", copybookStagingHDFSPath).
option("schema_retention_policy", "collapse_root").
load(sourceFileStagingHDFSPath + file)

This copybook file I use is the official one of the DB2 table. For this reason it is strange that it does not match the binary file.
Do you have any suggestions?

Answer 21 · 2021-02-10T15:28:10.000Z

You can try

.option("debug_ignore_file_size", "true")

to suppress the message to fetch first several records (e.g. df.show()). But it won't work when reading the full file.

It's possible that tolls that transfer mainframe files to a PC add additional headers to records (RDW, for instance). So without looking at the actual file it is hard to tell why the data doesn't match the copybook. Maybe the above workaround will give you a clue. I'd strange that you didn't have the same error message when loading via '0.2.5'.

I addition what you can do is compare the layout of fields with the copybook or a similar layout produced in the mainframe to ensure they match.
Also, you can validate that the record size calculated by Cobrix (594 bytes) is correct by opening the file using a HEX editor and checking the data there.

Answer 22 · 2021-02-10T16:26:55.000Z

From hex editor the file size is 626. In fact with a different copybook (using fillers instead of refefines) I read the binary file correctly and the size from cobrix is 626.
My initial problem (blank vs null), I thought I could solve it using another copybook format, i.e. with the redefines but this only led me to no longer read the binary file.

Answer 23 · 2021-02-10T16:43:05.000Z

We might include the saded version of ANTLR Runtime in the future version of Cobrix so people could use the latest spark-cobol with older Spark versions.

But back to the original issue. Can you tell me if nulls are encoded differently than blanks? Maybe you can send a record (or a HEX dump screenshot) and I'll take a look. I need a HEX dump of a record and a description which field should be null and which should be blank.

Answer 24 · 2021-02-11T14:07:38.000Z

Hi, this is binary file with two records that contain blank and null:
TABLE.zip

This file contains datatype of columns in DB2:
ddl_table.txt

This is copybook file with FILLER:
copybook_filler_table.COPY.txt

and this is copybook file with REDEFINES:
table.txt

Thanks for the support.

Answer 25 · 2021-02-11T15:59:24.000Z

Great. I've tried to load it using the latest Cobrix:

+------+------------------+------------------+------+------+------+------+------+------+-------+-------+--------------------------------------------------------------------------------------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
|FIELD1|FIELD2            |FIELD3            |FIELD4|FIELD5|FIELD6|FIELD7|FIELD8|FIELD9|FIELD10|FIELD11|FIELD12                                                                                     |FIELD13|FIELD14|FIELD15|FIELD16|FIELD17|FIELD18|FIELD19|FIELD20|FIELD21|FIELD22|FIELD23|FIELD24|FIELD25|FIELD26|FIELD27|
+------+------------------+------------------+------+------+------+------+------+------+-------+-------+--------------------------------------------------------------------------------------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
|51    |200811251121143139|201111181005012870|null  |      |?     |TEST  |null  |      |?      |?      |XXX  YYY           ?   ?     ?25.11.2008           ?   K  ZZZ  a         ABCDEFG    a &    S|       |       |?      |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |
|51    |201111181005012870|999912312359599999|null  |      |?     |TEST  |null  |      |?      |       |XXX  YYY           ?   ?     ?25.11.2008           ?   K  ZZZ    a &  *  ABCDEFG    a &  * S|       |       |?      |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |null   |
+------+------------------+------------------+------+------+------+------+------+------+-------+-------+--------------------------------------------------------------------------------------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+

Is this the expected output or are there unexpected blanks?

I used the following options:

    val df = spark
      .read
      .format("cobol")
      .option("copybook", "copybook_filler_table.COPY.txt")
      .option("schema_retention_policy", "collapse_root") 
      .option("record_length", "395")
      .load("TABLE.HELP")

Here is the same with debugging:

+------+------------+------------------+--------------------+------------------+--------------------+------+------------+------+------------+------+----------------------------------------+------+------------+------+------------+------+------------+-------+-------------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+----------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+
|FIELD1|FIELD1_debug|FIELD2            |FIELD2_debug        |FIELD3            |FIELD3_debug        |FIELD4|FIELD4_debug|FIELD5|FIELD5_debug|FIELD6|FIELD6_debug                            |FIELD7|FIELD7_debug|FIELD8|FIELD8_debug|FIELD9|FIELD9_debug|FIELD10|FIELD10_debug|FIELD11|FIELD11_debug                                                                                                                                                                                               |FIELD12                                                                                     |FIELD12_debug                                                                                                                                                                                               |FIELD13|FIELD13_debug                                                                                                                                                                                               |FIELD14|FIELD14_debug   |FIELD15|FIELD15_debug|FIELD16|FIELD16_debug|FIELD17|FIELD17_debug|FIELD18|FIELD18_debug|FIELD19|FIELD19_debug|FIELD20|FIELD20_debug|FIELD21|FIELD21_debug|FIELD22|FIELD22_debug|FIELD23|FIELD23_debug|FIELD24|FIELD24_debug|FIELD25|FIELD25_debug|FIELD26|FIELD26_debug|FIELD27|FIELD27_debug|
+------+------------+------------------+--------------------+------------------+--------------------+------+------------+------+------------+------+----------------------------------------+------+------------+------+------------+------+------------+-------+-------------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+----------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+
|51    |000000051C  |200811251121143139|0200811251121143139C|201111181005012870|0201111181005012870C|null  |00000000F5  |      |40400000    |?     |0000000000000000000000000000000000006F00|TEST  |E3C5E2E3    |null  |000CF0      |      |40404000    |?      |0000006F     |?      |00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000006F|XXX  YYY           ?   ?     ?25.11.2008           ?   K  ZZZ  a         ABCDEFG    a &    S|03E7E7E70003E8E8E840404040404040400000006F0000006F00000C00006FF2F54BF1F14BF2F0F0F800000000000000000000006F000004D20003E9E9E90200811251121143690C0007C1C2C3C4C5C6C70201111181005013072C00E2404040404040404040|       |404040404040404040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000|       |0000000000000000|?      |6F           |null   |null         |null   |null         |null   |null         |null   |null         |null   |null         |null   |null         |null   |null         |null   |null         |null   |null         |null   |null         |null   |null         |null   |null         |
|51    |000000051C  |201111181005012870|0201111181005012870C|999912312359599999|0999912312359599999C|null  |00000000F5  |      |40400000    |?     |0000000000000000000000000000000000006F00|TEST  |E3C5E2E3    |null  |000CF0      |      |40404000    |?      |0000006F     |       |000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000|XXX  YYY           ?   ?     ?25.11.2008           ?   K  ZZZ    a &  *  ABCDEFG    a &  * S|03E7E7E70003E8E8E840404040404040400000006F0000006F00000C00006FF2F54BF1F14BF2F0F0F800000000000000000000006F000004D20003E9E9E90201111181005013095C0007C1C2C3C4C5C6C70201111181005013095C00E2404040404040404040|       |404040404040404040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000|       |0000000000000000|?      |6F           |null   |null         |null   |null         |null   |null         |null   |null         |null   |null         |null   |null         |null   |null         |null   |null         |null   |null         |null   |null         |null   |null         |null   |null         |
+------+------------+------------------+--------------------+------------------+--------------------+------+------------+------+------------+------+----------------------------------------+------+------------+------+------------+------+------------+-------+-------------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+----------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+

Answer 26 · 2021-02-11T16:26:59.000Z

Here is Cobrix 2.2.0 that embeds ANTLR and can be used with Spark 2.2.0 :)
cobrix_2_2_0_spark_2_2_0.jar.zip

Answer 27 · 2021-02-12T11:56:27.000Z

Hi,

thanks for the jar.

The result is not correct. Here I report the data on DB2 corresponding to the dump file sent previously.

tableInsertDB2.zip

How did you calculate the value of record_lenght?

Answer 28 · 2021-02-12T13:12:23.000Z

I calculated the record size by dividing the file size by 2, since I knew the file has 2 records.

Looking at your SQL statement, it seems your copybook doesn't match the data.
For instance, 00 00 00 00 00 F5 is not a valid COMP-3 number. It is in DISPLAY format (or no 'COMP' usage).

After I changed the copybook to:

      *    10  FILLER PIC X(1).
      *    10 FIELD4 PIC S9(9) COMP-3.
          10 FIELD4 PIC S9(6) DISPLAY.

(removed the FILLER and changed the type of FIELD4)

The field looks OK:

You need to go though your fields one by one and make sure the size and the type of each field is defined properly. Otherwise, there is not much Cobrix can do.

If you have difficulties with particular fields, let me know and I can help. But it's up to you to make the copybook match the data.

Answer 29 · 2021-02-18T15:27:23.000Z

Hi,

The previously submitted dump was inconsistent with the submitted copybook. For this reason the resulting data was incorrect.
I was able to debug every single column on dump file with option debug=true, and I found that fields containing blank and null have a different encoding (null --> 0, blank--> 40). But unfortunately in the dataframe after reading, I only see blank.
Fields, that contain blank and null, in the copybook file are all of the PIC X () type and with FILLER. You said in a previous post that maybe there is a way to tell them apart by having a different encoding.

Answer 30 · 2021-02-18T17:09:08.000Z

Great, that's the difference we can use to distinguish between banks and nulls.
We'll implement an option, say:

.option("improved_null_detection","true")

I can send you a snapshot version to test before releasing a new version.

Answer 31 · 2021-02-19T07:59:04.000Z

Thank you very much. If you can send it, it can be useful to test my data.

Answer 32 · 2021-02-19T09:02:01.000Z

Cool, will send as soon as it is ready

Answer 33 · 2021-02-19T14:56:26.000Z

Here is the snapshot:
cobrix_2_2_0_spark_2_2_1-SNAPSHOT.zip

Use

      .option("pedantic", "true")
      .option("improved_null_detection", "true")

Use 'pedantic' to make Cobrix complain if there are redundant options.

Answer 34 · 2021-02-24T15:09:22.000Z

Have you had a chance to check the new option?

Answer 35 · 2021-07-08T09:11:06.000Z

Sorry, it seems the fix wasn't included in 2.2.1. Will be included in 2.2.3