AbsaOSS/cobrix

ebcdic_code_page for German character ä,ß,ü

MJames1030 opened this issue · 9 comments

Hi,

I'm using cobrix libraries on databricks to convert EBCDIC files. I have now a file with Geman alphabet, and I did not find any ebcdic_code_page to read the german alphabet.

Example: "{u~eren R} cksitzpl {those in Mikrovlies" is returned instead of "deräußeren Rücksitzplätze in Mikrovlies " ArtVerlours Eco ""

Thank you,
Jamal

Hi @MJames1030, thanks for the feature request!
It is possible to add a custom EBCDIC code page if you know the EBCDIC -> ASCII/Unicode conversion for your characters. See the example here:

But if your code page is one of standard ones, and you know which code page is used at the source, we add support for this code page directly in Cobrix.

Hi @yruslan ,

Thank you for your feedback. We are able to convert the EBCDIC file by using the code page 273. If you could add it directly to cobrix it will be great.

Thank you in advance,
Jamal

The support is added in this branch: https://github.com/AbsaOSS/cobrix/tree/feature/653-add-ebcdic-codepage-273

If you could test it before we release the new version of Cobrix, that could help to ensure it works for you as expected.

You can build a bundle Cobrix jar using sbt assembly, and use the snapshot JAR in your Spark environment.

sbt -DSPARK_VERSION="3.4.0" ++2.12.17 assembly

The code page can be selected by passing the option to the Spark reader:

spark.read.format("cobol")
  .option("ebcdic_code_page", "cp273")
  ...

Hi @MJames1030, use 'spark-cobol-...-SNAPSHOT-bundle.jar', not 'cobol-parser-*'. The cobol parser is for use cases that do not sure Spark.

Hi @yruslan ,

I confirm that's working.
image

Thank you for the work,

Awesome, this will be released soon.

Do you have a date in mind ? :)

Tomorrow, or in worst case Thursday 😆

Fixed in 2.6.10