data in string format apart from native target datatype format

Question

data in string format apart from native target datatype format

saikumare-a opened this issue 3 years ago · 14 comments

Background

cobrix converts the data to native type ( decimal, integer etc.,) based on the copybook information.

Feature

having an option of just dividing the record to columns and having them in string format(as it is , without any trimming) instead of converting to native type would be helpful and provide the below benefits

if there is discrepancy between data and copybook , having all columns in string type would help in debugging issues
can be helpful to do Data Quality by downstream applications and reports issues (currently invalid data becomes null by spark)

Example [Optional]

A simple example if applicable.

Proposed Solution [Optional]

Solution Ideas

one approach could be handling this using option("debug","original")

Answer 1 · 2022-12-12T15:42:32.000Z

Nice idea, but it can only work for fields having 'DISPLAY' usage, and also encoding (ascii/ebcdic) dependent.
Binary, BCD, floating point numbers contain bytes that can't be converted to characters.

.option("debug", "true") aka .option("debug", "hex") works well for investigating copybook discrepancy issues, and can be used for quality control, e.g. expecting nulls to be only for '0x00 0x00...' byte stream.

.option("debug", "raw") helps preserving original data, which you can use to convert to sitting if you want.

Can you give a concrete example (field, its PIC, and value) that would help debugging it as a string?

Answer 2 · 2022-12-13T08:15:29.000Z

After thinking about it, the above feature makes sense for ASCII files, but not for EBCDIC.
I see how it could be helpful for ASCII.

Answer 3 · 2022-12-13T08:45:24.000Z

Thanks for reply and as rightly said, this would be very useful in case of ASCII case.

please provide thoughts on adding this feature (plan and time etc.,) . Thank you for the support

Answer 4 · 2022-12-15T15:56:50.000Z

It is hard to say for certain. Maybe end of this year, or Jan next year.

Answer 5 · 2022-12-19T14:39:55.000Z

This is done and available in the latest 'master'

Answer 6 · 2022-12-19T15:57:51.000Z

Hi @yruslan ,

Thanks a lot, i am from python world and no idea about creating the jar file .could you help with steps to create a jar file or attach the jar file to this issue, so that i can test and let you know

Answer 7 · 2022-12-19T16:01:51.000Z

Sure. Which Spark and Scala version are you using?

Answer 8 · 2022-12-19T16:09:14.000Z

Using Spark 3.1.2, Scala 2.12

currently using the below cobrix version

groupId: za.co.absa.cobrix
artifactId: spark-cobol_2.12
version: 2.6.1

Answer 9 · 2022-12-20T10:09:52.000Z

Here, you can try this one:
spark-cobol-assembly-2.6.2-SNAPSHOT.zip

Answer 10 · 2022-12-20T10:45:15.000Z

Awesome, validated and working as expected. Thanks for the quick turnaround with this enhancement

Answer 11 · 2022-12-20T10:54:37.000Z

Hi @yruslan,

with option("debug","string"), we see string data in <col_name>_debug fields, how above showing this string data in actual fields instead of <col_name>_debug fields. this would help in showing actual data in actual columns and downstream can take care of handling next step

one option, we can handle this post cobrix by custom code,
handling in cobrix, might help other cobrix users

Answer 12 · 2022-12-21T07:55:56.000Z

So basically what you need is to slice ASCII records based on field lengths from a copybook with all columns are strings, right?

I think in ASCII files you can only have numbers with usage DISPLAY. So if numbers could be retained as strings, it could help you, right?

Here is another feature request related to this: #25

Answer 13 · 2022-12-21T08:09:13.000Z

Yes, correct,

is this #25 , already available currently in cobrix?, if yes, please add this info in documentation as i dont see this in documentation

Answer 14 · 2022-12-21T14:20:18.000Z

No, it is not implemented yet. But it is the plans to implement it in the future.