Null vs Blank Issue
ilaria-ia opened this issue · 35 comments
I'm using the cobrix library to read an EBDCID binary file, obtained as a table extraction on DB2. The reading operation output is writed as CSV file on HDFS.
For parsing the binary file I'm using a copybook file present on HDFS.
Here is the spark reading command:
spark.read.
format (sourceFileFormat).
option ("copybook", copybookStagingHDFSPath).
option ("schema_retention_policy", "collapse_root").
option ("string_trimming_policy", "none").
load (sourceFileStagingHDFSPath + file)
The resulting output on CSV file does not allow the distinction between BLANK ("") and NULL cells but are both treated as empty strings.
Is there a way to treat BLANK ("") value and NULL value differently, in order to have an output corresponding to the data present on the DB2 source database?
Hi,
Cobrix does distinguish between nulls and empty values.
It's up to the CSV writer to write different values for nulls and empty values.
You can achieve this by specifying a value that should be written to a CSV field when the input value is null.
df.write
.option("nullValue", "NULL")
.csv("/some/path")
Hi,
Thanks for the reply. I tested the writing mode you recommended, but the problem did not go away. Executing a show on the dataframe created by the read operation and comparing it with the source table on DB2, it was verified that the null cells on the source are read blank in the dataframe. So the error is being read not writing. In fact, since null cells are not created in the dataframe, the option option ("nullValue", "NULL") does not generate null values in the csv written on HDFS.
For the sake of completeness, I report an example to help understand the difference between db2 and post-reading dataframe.
Examble_DB2vsDataframePostRead.xlsx
This is read operation:
val sourceFileDF: (Int, DataFrame) = Try.apply({
spark.read.
format(sourceFileFormat).
option("copybook", copybookStagingHDFSPath).
option("schema_retention_policy", "collapse_root").
option("string_trimming_policy", "left").
option("inferSchema",false).
load(sourceFileStagingHDFSPath + file)
Is it possible that I need to add special options to fix the problem?
I see. What are the data types of the problematic columns? How their PICs look like in teh copybook?
I have a suspicion that it might be the limitation of the mainframe format itself. It might be that empty strings are encoded exactly the same as nulls on the export from DB2. If this is the case, there is nothing we can do.
Data tyoes and PICs of columns are:
campo1 PIC X(20). CHAR(20) FOR SBCS DATA WITH DEFAULT NULL
FILLER PIC X(1).
campo2 PIC X(4). CHAR(4) FOR SBCS DATA WITH DEFAULT NULL
FILLER PIC X(1).
campo3 PIC X(102). VARCHAR(100) FOR SBCS DATA WITH DEFAULT NULL
FILLER PIC X(1).
campo4 PIC X(8). CHAR(8) FOR SBCS DATA WITH DEFAULT NULL
FILLER PIC X(1).
campo5 PIC X(2). CHAR(2) FOR SBCS DATA WITH DEFAULT NULL
FILLER PIC X(1).
campo6 PIC X(3). CHAR(3) FOR SBCS DATA WITH DEFAULT NULL
FILLER PIC X(1).
campo7 PIC X(1). CHAR(1) FOR SBCS DATA WITH DEFAULT NULL
FILLER PIC X(1).
campo8 PIC X(10). DATE WITH DEFAULT NULL
FILLER PIC X(1).
campo9 PIC X(20). CHAR(20) FOR SBCS DATA WITH DEFAULT NULL
FILLER PIC X(1).
campo10 PIC X(102). VARCHAR(100) FOR SBCS DATA WITH DEFAULT NULL
FILLER PIC X(1).
That's what I was suspected. Fields that are encoded as 'PIC X' don't support the notion of NULLs. So when you export a table from DB2 to a mainframe file, both empty strings and nulls are encoded as "40404040" (40 is space in EBCDIC). If encoded values are different, it might be possible to differentiate it.
Could I ask you to run with .option("debug", "true") to see how empty strings and nulls are encoded?
Is this option valid for spark-cobol version 0.2.5? I have spark 2.2.0
No, this option has been introduced quite recently. But you can try spark-cobol
2.2.0 with Spark 2.2.0.
Unfortunately, with version 2.2.0 I always have this error which I cannot resolve:
Exception in thread "main" java.lang.NoClassDefFoundError: org/antlr/v4/runtime/CharStreams
This might be easy to solve. :)
Get this jar: antlr4-runtime-4.7.2.jar and add it to spark-submit / spark-shell with --jars
I did this, but error is
INFO source.DefaultSource: Cobrix 'spark-cobol' build 2.2.0 (2020-12-30T10:25:41)
ANTLR Tool version 4.7.2 used for code generation does not match the current runtime version 4.5.3ANTLR Runtime version 4.7.2 used for parser compilation does not match the current runtime version 4.5.3Exception in thread "main" java.lang.ExceptionInInitializerError
It's strange. What's your command line?
I checked. It seems older Spark versions used different ANTLR and they provide it in runtime.
You can solve this by compiling the latest Cobrix from sources replacing Spark version.
But that won't solve nulls issue. Alternatively, you can use a HEX editor to look into the data file and check if nulls are encoded differently than empty strings.
If i replace FILLER with this format copybook:
FIELD PIC X(20).
FIELD-NULL REDEFINES
FIELD PIC X(20).
do i have to set a particular option?
No additional options are necessary here. But what are you trying to achieve?
I created the copybook file starting from DDL and LOAD file which indicates datatype, number of occupied positions and NULLIF position for each field. I translated NULLIF with FILLER in copybook, and I want and I would like to try to use REDEFINES instead of FILLER, to understand if the problem is the copybook file used.
Hi,
I tried to use the copy file attached. The cobrix library gives me the following error:
scala.MatchError: 5 (of class java.lang.Integer)
at za.co.absa.cobrix.cobol.parser.encoding.EBCDIC.codec(EBCDIC.scala:34)
at za.co.absa.cobrix.cobol.parser.ast.Primitive.getBinarySizeBits(Primitive.scala:90)
at za.co.absa.cobrix.cobol.parser.CopybookParser$.za$co$absa$cobrix$cobol$parser$CopybookParser$$calculatePrimitiveSize$1(CopybookParser.scala:203)
at za.co.absa.cobrix.cobol.parser.CopybookParser$$anonfun$calculateSchemaSizes$1.apply$mcVI$sp(CopybookParser.scala:233)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at za.co.absa.cobrix.cobol.parser.CopybookParser$.calculateSchemaSizes(CopybookParser.scala:214)
at za.co.absa.cobrix.cobol.parser.CopybookParser$.za$co$absa$cobrix$cobol$parser$CopybookParser$$calculateGroupSize$1(CopybookParser.scala:195)
at za.co.absa.cobrix.cobol.parser.CopybookParser$$anonfun$calculateSchemaSizes$1.apply$mcVI$sp(CopybookParser.scala:232)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at za.co.absa.cobrix.cobol.parser.CopybookParser$.calculateSchemaSizes(CopybookParser.scala:214)
at za.co.absa.cobrix.cobol.parser.CopybookParser$.za$co$absa$cobrix$cobol$parser$CopybookParser$$calculateGroupSize$1(CopybookParser.scala:195)
at za.co.absa.cobrix.cobol.parser.CopybookParser$$anonfun$calculateSchemaSizes$1.apply$mcVI$sp(CopybookParser.scala:232)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at za.co.absa.cobrix.cobol.parser.CopybookParser$.calculateSchemaSizes(CopybookParser.scala:214)
at za.co.absa.cobrix.cobol.parser.CopybookParser$.calculateBinaryProperties(CopybookParser.scala:181)
at za.co.absa.cobrix.cobol.parser.CopybookParser$.parseTree(CopybookParser.scala:153)
at za.co.absa.cobrix.spark.cobol.reader.fixedlen.FixedLenNestedReader.loadCopyBook(FixedLenNestedReader.scala:74)
at za.co.absa.cobrix.spark.cobol.reader.fixedlen.FixedLenNestedReader.(FixedLenNestedReader.scala:41)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createFixedLengthReader(DefaultSource.scala:82)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.buildEitherReader(DefaultSource.scala:69)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:50)
at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:45)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:307)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
... 48 elided
Should I enter additional options for this copybook?
Hi, Sorry for the late reply. What version of Cobrix are you using? and what exact options are you passing to Spark?
This is a very strange error.
These are dependencies:
<dependency>
<groupId>za.co.absa.cobrix</groupId>
<artifactId>spark-cobol</artifactId>
<version>0.2.5</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.scodec/scodec-core -->
<dependency>
<groupId>org.scodec</groupId>
<artifactId>scodec-core_2.11</artifactId>
<version>1.10.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.scodec/scodec-bits -->
<dependency>
<groupId>org.scodec</groupId>
<artifactId>scodec-bits_2.11</artifactId>
<version>1.1.4</version>
</dependency>
<!-- https://mvnrepository.com/artifact/za.co.absa.cobrix/cobol-parser -->
<dependency>
<groupId>za.co.absa.cobrix</groupId>
<artifactId>cobol-parser</artifactId>
<version>0.2.5</version>
</dependency>
Spark read command is:
spark.read.
format("za.co.absa.cobrix.spark.cobol.source").
option("copybook", "file:///.../file.copy").
option("schema_retention_policy","collapse_root").
load("/path/file_20210119140000")
Before I take a look. Why do you use 0.2.5
? The latest version that does not use ANTLR runtime is 0.5.6
. Please, try 0.5.6
first.
Also, you don't need to include scodec and cobol-parser deendencies. They are transitive and will be added automatically.
Thanks a lot for the tip. I changed the version and the error changed. The mistake is:
There are some files in hdfs://path/fileBin that are NOT DIVISIBLE by the RECORD SIZE calculated from the copybook (594 bytes per record)._
I used this command:
spark.read.
format("za.co.absa.cobrix.spark.cobol.source").
option("copybook", copybookStagingHDFSPath).
option("schema_retention_policy", "collapse_root").
load(sourceFileStagingHDFSPath + file)
This copybook file I use is the official one of the DB2 table. For this reason it is strange that it does not match the binary file.
Do you have any suggestions?
You can try
.option("debug_ignore_file_size", "true")
to suppress the message to fetch first several records (e.g. df.show()). But it won't work when reading the full file.
It's possible that tolls that transfer mainframe files to a PC add additional headers to records (RDW, for instance). So without looking at the actual file it is hard to tell why the data doesn't match the copybook. Maybe the above workaround will give you a clue. I'd strange that you didn't have the same error message when loading via '0.2.5'.
I addition what you can do is compare the layout of fields with the copybook or a similar layout produced in the mainframe to ensure they match.
Also, you can validate that the record size calculated by Cobrix (594 bytes) is correct by opening the file using a HEX editor and checking the data there.
From hex editor the file size is 626. In fact with a different copybook (using fillers instead of refefines) I read the binary file correctly and the size from cobrix is 626.
My initial problem (blank vs null), I thought I could solve it using another copybook format, i.e. with the redefines but this only led me to no longer read the binary file.
We might include the saded version of ANTLR Runtime in the future version of Cobrix so people could use the latest spark-cobol with older Spark versions.
But back to the original issue. Can you tell me if nulls are encoded differently than blanks? Maybe you can send a record (or a HEX dump screenshot) and I'll take a look. I need a HEX dump of a record and a description which field should be null and which should be blank.
Hi, this is binary file with two records that contain blank and null:
TABLE.zip
This file contains datatype of columns in DB2:
ddl_table.txt
This is copybook file with FILLER:
copybook_filler_table.COPY.txt
and this is copybook file with REDEFINES:
table.txt
Thanks for the support.
Great. I've tried to load it using the latest Cobrix:
+------+------------------+------------------+------+------+------+------+------+------+-------+-------+--------------------------------------------------------------------------------------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
|FIELD1|FIELD2 |FIELD3 |FIELD4|FIELD5|FIELD6|FIELD7|FIELD8|FIELD9|FIELD10|FIELD11|FIELD12 |FIELD13|FIELD14|FIELD15|FIELD16|FIELD17|FIELD18|FIELD19|FIELD20|FIELD21|FIELD22|FIELD23|FIELD24|FIELD25|FIELD26|FIELD27|
+------+------------------+------------------+------+------+------+------+------+------+-------+-------+--------------------------------------------------------------------------------------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
|51 |200811251121143139|201111181005012870|null | |? |TEST |null | |? |? |XXX YYY ? ? ?25.11.2008 ? K ZZZ a ABCDEFG a & S| | |? |null |null |null |null |null |null |null |null |null |null |null |null |
|51 |201111181005012870|999912312359599999|null | |? |TEST |null | |? | |XXX YYY ? ? ?25.11.2008 ? K ZZZ a & * ABCDEFG a & * S| | |? |null |null |null |null |null |null |null |null |null |null |null |null |
+------+------------------+------------------+------+------+------+------+------+------+-------+-------+--------------------------------------------------------------------------------------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
Is this the expected output or are there unexpected blanks?
I used the following options:
val df = spark
.read
.format("cobol")
.option("copybook", "copybook_filler_table.COPY.txt")
.option("schema_retention_policy", "collapse_root")
.option("record_length", "395")
.load("TABLE.HELP")
Here is the same with debugging:
+------+------------+------------------+--------------------+------------------+--------------------+------+------------+------+------------+------+----------------------------------------+------+------------+------+------------+------+------------+-------+-------------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+----------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+
|FIELD1|FIELD1_debug|FIELD2 |FIELD2_debug |FIELD3 |FIELD3_debug |FIELD4|FIELD4_debug|FIELD5|FIELD5_debug|FIELD6|FIELD6_debug |FIELD7|FIELD7_debug|FIELD8|FIELD8_debug|FIELD9|FIELD9_debug|FIELD10|FIELD10_debug|FIELD11|FIELD11_debug |FIELD12 |FIELD12_debug |FIELD13|FIELD13_debug |FIELD14|FIELD14_debug |FIELD15|FIELD15_debug|FIELD16|FIELD16_debug|FIELD17|FIELD17_debug|FIELD18|FIELD18_debug|FIELD19|FIELD19_debug|FIELD20|FIELD20_debug|FIELD21|FIELD21_debug|FIELD22|FIELD22_debug|FIELD23|FIELD23_debug|FIELD24|FIELD24_debug|FIELD25|FIELD25_debug|FIELD26|FIELD26_debug|FIELD27|FIELD27_debug|
+------+------------+------------------+--------------------+------------------+--------------------+------+------------+------+------------+------+----------------------------------------+------+------------+------+------------+------+------------+-------+-------------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+----------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+
|51 |000000051C |200811251121143139|0200811251121143139C|201111181005012870|0201111181005012870C|null |00000000F5 | |40400000 |? |0000000000000000000000000000000000006F00|TEST |E3C5E2E3 |null |000CF0 | |40404000 |? |0000006F |? |00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000006F|XXX YYY ? ? ?25.11.2008 ? K ZZZ a ABCDEFG a & S|03E7E7E70003E8E8E840404040404040400000006F0000006F00000C00006FF2F54BF1F14BF2F0F0F800000000000000000000006F000004D20003E9E9E90200811251121143690C0007C1C2C3C4C5C6C70201111181005013072C00E2404040404040404040| |404040404040404040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000| |0000000000000000|? |6F |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |
|51 |000000051C |201111181005012870|0201111181005012870C|999912312359599999|0999912312359599999C|null |00000000F5 | |40400000 |? |0000000000000000000000000000000000006F00|TEST |E3C5E2E3 |null |000CF0 | |40404000 |? |0000006F | |000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000|XXX YYY ? ? ?25.11.2008 ? K ZZZ a & * ABCDEFG a & * S|03E7E7E70003E8E8E840404040404040400000006F0000006F00000C00006FF2F54BF1F14BF2F0F0F800000000000000000000006F000004D20003E9E9E90201111181005013095C0007C1C2C3C4C5C6C70201111181005013095C00E2404040404040404040| |404040404040404040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000| |0000000000000000|? |6F |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |
+------+------------+------------------+--------------------+------------------+--------------------+------+------------+------+------------+------+----------------------------------------+------+------------+------+------------+------+------------+-------+-------------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+----------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+-------+-------------+
Here is Cobrix 2.2.0 that embeds ANTLR and can be used with Spark 2.2.0 :)
cobrix_2_2_0_spark_2_2_0.jar.zip
Hi,
thanks for the jar.
The result is not correct. Here I report the data on DB2 corresponding to the dump file sent previously.
How did you calculate the value of record_lenght?
I calculated the record size by dividing the file size by 2, since I knew the file has 2 records.
Looking at your SQL statement, it seems your copybook doesn't match the data.
For instance, 00 00 00 00 00 F5
is not a valid COMP-3
number. It is in DISPLAY
format (or no 'COMP' usage).
After I changed the copybook to:
* 10 FILLER PIC X(1).
* 10 FIELD4 PIC S9(9) COMP-3.
10 FIELD4 PIC S9(6) DISPLAY.
(removed the FILLER and changed the type of FIELD4)
You need to go though your fields one by one and make sure the size and the type of each field is defined properly. Otherwise, there is not much Cobrix can do.
If you have difficulties with particular fields, let me know and I can help. But it's up to you to make the copybook match the data.
Hi,
The previously submitted dump was inconsistent with the submitted copybook. For this reason the resulting data was incorrect.
I was able to debug every single column on dump file with option debug=true, and I found that fields containing blank and null have a different encoding (null --> 0, blank--> 40). But unfortunately in the dataframe after reading, I only see blank.
Fields, that contain blank and null, in the copybook file are all of the PIC X () type and with FILLER. You said in a previous post that maybe there is a way to tell them apart by having a different encoding.
Great, that's the difference we can use to distinguish between banks and nulls.
We'll implement an option, say:
.option("improved_null_detection","true")
I can send you a snapshot version to test before releasing a new version.
Thank you very much. If you can send it, it can be useful to test my data.
Cool, will send as soon as it is ready
Here is the snapshot:
cobrix_2_2_0_spark_2_2_1-SNAPSHOT.zip
Use
.option("pedantic", "true")
.option("improved_null_detection", "true")
Use 'pedantic' to make Cobrix complain if there are redundant options.
Have you had a chance to check the new option?
Sorry, it seems the fix wasn't included in 2.2.1. Will be included in 2.2.3