ebcdic_code_page for German character ä,ß,ü
MJames1030 opened this issue · 9 comments
Hi,
I'm using cobrix libraries on databricks to convert EBCDIC files. I have now a file with Geman alphabet, and I did not find any ebcdic_code_page to read the german alphabet.
Example: "{u~eren R} cksitzpl {those in Mikrovlies" is returned instead of "deräußeren Rücksitzplätze in Mikrovlies " ArtVerlours Eco ""
Thank you,
Jamal
Hi @MJames1030, thanks for the feature request!
It is possible to add a custom EBCDIC code page if you know the EBCDIC -> ASCII/Unicode conversion for your characters. See the example here:
- Example custom code page: https://github.com/AbsaOSS/cobrix/blob/master/spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/utils/CustomCodePage.scala
- Example code that uses the custom code page:
.spark.read.format("cobol") .option("ebcdic_code_page_class", "za.co.absa.cobrix.spark.cobol.source.utils.CustomCodePage")
But if your code page is one of standard ones, and you know which code page is used at the source, we add support for this code page directly in Cobrix.
Hi @yruslan ,
Thank you for your feedback. We are able to convert the EBCDIC file by using the code page 273. If you could add it directly to cobrix it will be great.
Thank you in advance,
Jamal
The support is added in this branch: https://github.com/AbsaOSS/cobrix/tree/feature/653-add-ebcdic-codepage-273
If you could test it before we release the new version of Cobrix, that could help to ensure it works for you as expected.
You can build a bundle Cobrix jar using sbt assembly
, and use the snapshot JAR in your Spark environment.
sbt -DSPARK_VERSION="3.4.0" ++2.12.17 assembly
The code page can be selected by passing the option to the Spark reader:
spark.read.format("cobol")
.option("ebcdic_code_page", "cp273")
...
Hi @MJames1030, use 'spark-cobol-...-SNAPSHOT-bundle.jar', not 'cobol-parser-*'. The cobol parser is for use cases that do not sure Spark.
Awesome, this will be released soon.
Do you have a date in mind ? :)
Tomorrow, or in worst case Thursday 😆
Fixed in 2.6.10