AbsaOSS/cobrix

data in string format apart from native target datatype format

saikumare-a opened this issue · 14 comments

Background

cobrix converts the data to native type ( decimal, integer etc.,) based on the copybook information.

Feature

having an option of just dividing the record to columns and having them in string format(as it is , without any trimming) instead of converting to native type would be helpful and provide the below benefits

  1. if there is discrepancy between data and copybook , having all columns in string type would help in debugging issues
  2. can be helpful to do Data Quality by downstream applications and reports issues (currently invalid data becomes null by spark)

Example [Optional]

A simple example if applicable.

Proposed Solution [Optional]

Solution Ideas

  1. one approach could be handling this using option("debug","original")

Nice idea, but it can only work for fields having 'DISPLAY' usage, and also encoding (ascii/ebcdic) dependent.
Binary, BCD, floating point numbers contain bytes that can't be converted to characters.

.option("debug", "true") aka .option("debug", "hex") works well for investigating copybook discrepancy issues, and can be used for quality control, e.g. expecting nulls to be only for '0x00 0x00...' byte stream.

.option("debug", "raw") helps preserving original data, which you can use to convert to sitting if you want.

Can you give a concrete example (field, its PIC, and value) that would help debugging it as a string?

After thinking about it, the above feature makes sense for ASCII files, but not for EBCDIC.
I see how it could be helpful for ASCII.

Thanks for reply and as rightly said, this would be very useful in case of ASCII case.

please provide thoughts on adding this feature (plan and time etc.,) . Thank you for the support

It is hard to say for certain. Maybe end of this year, or Jan next year.

This is done and available in the latest 'master'

Hi @yruslan ,

Thanks a lot, i am from python world and no idea about creating the jar file .could you help with steps to create a jar file or attach the jar file to this issue, so that i can test and let you know

Sure. Which Spark and Scala version are you using?

Using Spark 3.1.2, Scala 2.12

currently using the below cobrix version

groupId: za.co.absa.cobrix
artifactId: spark-cobol_2.12
version: 2.6.1

Here, you can try this one:
spark-cobol-assembly-2.6.2-SNAPSHOT.zip

Awesome, validated and working as expected. Thanks for the quick turnaround with this enhancement

Hi @yruslan,

with option("debug","string"), we see string data in <col_name>_debug fields, how above showing this string data in actual fields instead of <col_name>_debug fields. this would help in showing actual data in actual columns and downstream can take care of handling next step

one option, we can handle this post cobrix by custom code,
handling in cobrix, might help other cobrix users

So basically what you need is to slice ASCII records based on field lengths from a copybook with all columns are strings, right?

I think in ASCII files you can only have numbers with usage DISPLAY. So if numbers could be retained as strings, it could help you, right?

Here is another feature request related to this: #25

Yes, correct,

is this #25 , already available currently in cobrix?, if yes, please add this info in documentation as i dont see this in documentation

No, it is not implemented yet. But it is the plans to implement it in the future.