Spark DataFrame Writer for Cobol datafiles
Closed this issue · 44 comments
Background
I work for a credit card company in the retail sector, and we are currently utilizing Cobrix to acquire data from our credit card transaction processor and produce business events to Kafka for our event driven architecture and analytic platform.
Thanks to @yruslan and his work with #338 Cobrix is now fully functional for our data ingest use case, however, our electronic data interchange with this business partner is bidirectional.
For example we receive mainframe data transmissions for things like customer purchases, and account status. But we also have to transmit monetary data to our mainframe based partner for things like credits and adjustments, and non-monetary data for account configuration changes including but not limited to change of address.
Additionally, we also believe that such a feature could also be used to simplify the process of creating test data for our system.
Feature
Implement a Spark DataFrame writer for Cobol data, the feature should:
- Derive a default copybook layout from the Spark Schema
- Support configurable endianness
- Support configurable code page output
- Support writing Cobol output data files in the F, FB, V, VB file types from https://www.ibm.com/docs/en/zos-basic-skills?topic=set-data-record-formats
- Support the writing of a copybook file that matches the output schema as-written
- Provide a declarative configuration option to override individual DataFrame Schema -> Copybook transformation decisions at a field level including:
- specify width for PIC X(n) fields
- specify scale and precision for PIC 9 fields such as S9(11)V99
- specify binary packing options for individual fields such as COMP-3
Proposed Solution [Optional]
We could contribute development labor to the implementation of the feature, however we would need assistance with high level design should such a feature be accepted.
At this point I would like to open a discussion about how such a feature might be implemented, and as I mentioned we would be willing to contribute some development labor to help make this feature a reality, but we would need some assistance in the architecture of the solution.
This sounds great. The demand for the feature seems to exists already, but the feature requires a lot of effort. This could be a good collaboration. As soon as the implementation of VBVR is finished (probably end of the next week), I can prepare a design document for a Cobol fire writer. We can discuss features the writer can support and prioritize features required for your use case. The features that are useful but not immediately required for you we can implement later from our side.
I think the work can be divided as independent tasks and with your help the feature can be implemented much faster.
I had a meeting to discuss the first draft of these requirements and one of my peers suggested that while dynamically creating a copybook from a Spark schema and declarative configuration was a nice feature, that it might be complex to implement and isn't really necessary for MVP.
My colleague suggested that perhaps a better idea would be to require a copybook layout be passed into the data frame writer, since we would have to set static field sizes for every column in the data frame anyway.
Of course we would have to verify that the DF schema can be mapped to the Copybook schema, but that may be an easier lift than programmatically generating a copybook.
In our use case the copybook is defined by our business partner, and we would have to ensure that the DF we generate can map to the service contract (copybook) that they are expecting.
Also on the subject of narrowing the MVP features, our use case only requires a single code page (I believe it is cp037 but will verify with the business partner), and only big endian.
All of our data ingest code uses CodePageCommon which is working adequately so far.
Good. We can start looking into requirements in about 2 weeks.
Actually, generating our own copybook from a Spark dataframe is easier since we can choose output data types. Conforming to an existing copybook would require supporting the plethora of formats that COBOL supports (picture, usage, etc). But conforming to an existing copybook is something that is usually required, so that's something that we should implement at some point anyway. And since it matches your use case we can look into that first.
Supporting only cp037 or basic + cp037 is good as well.
What about data formats? Do you need the support of F, V, VB (RDW, no RDW, BDW+RDW), or we can just start with basic V (RDW)?
I have a colleague researching this now, but the preliminary answer is that we need FB and VB formats. In a day or two I'll have a final answer and copybooks for you to review.
Mark is leaving Nordstrom and I will be taking over as a contact for Nordstrom
@yruslan as @milehighhokie indicated I have accepted a new position in another company and Bill will be taking over this issue for my former employer. We had a turnover meeting this morning, and I reminded him that you are still waiting on copybook examples for the outbound data transfer use case that I outlined in this issue.
I want to extend my thanks for the excellent support I have received while using Cobrix, and in particular I appreciate the opportunity to collaborate with you on adding the new record format readers.
Thanks for the kind words, Mark! Enjoy the holiday season and the best of luck at the new role!
@milehighhokie , looking forward to future collaboration.
Hi yruslan, we have a similar requirement for copybook writer. You have closed this issue. Did you make any progress in Spark Dataframe writer for copybook data files?
Hi, sorry, the writer would require a lot of effort and we don't have the capacity nor internal demand for it at the moment.
But it is in long term plans to o it.
Any updates on this. Any plans this year to implement this?
@yruslan - How can we collaborate on this feature with you. We are using cobrix in bank and we are sucessfully using it for mf to cloud data ingestion.However, we have a requirement to enable bidirectional flow to sync back to mf.
We have potential use cases for writing EBCDIC files as well, but it is not of high priority at the moment.
The writing feature would be very nice to have. A collaboration would definitely help. From our side, we can implement a basic/skeleton functionality of writing EBCDIC files from Spark dataframes. Then, if you have people at your side willing to contribute, the feature can be extended.
By basic functionality I mean writing EBCDIC mainfrme files with:
- The output copybook is provided by the user, not generated by Cobrix.
- The copybook should use only DISPLAY format for fields (no COMP-3 or binary numbers).
- struct types (GROUPs) are okay, but no arrays (OCCURs) initially.
- Only basic Unicode to EBCDIC code page.
- Only batch output, no streaming support.
In any case, this is a very big endeavour, as complex as reading mainframe files for general use cases.
We are fine with the basic functionality with “The copybook should use only DISPLAY format for fields (no COMP-3 or binary numbers).” . Let us know how we can proceed
Hi, @pinakigit
I'm going to plan to create a skeleton implementation, and then let you know when a PR is ready.
The skeleton implementation is going to include the basic functionality. New features can be contributed after that.
Thank you,
Ruslan
Thanks. Eagerly waiting for the skeleton implementation. Will be willing to contribute anyway we can for extending the features.
No updates so far.
No updates so far. Unfortunately too busy this month. Hopefully there would be some progress next month
Started working on it. It might take about a month to have a first writer with bare minimum features
Thanks. A basic skeleton will be good to start with. Will wait for it.
Hi, @yruslan, I hope you are doing well! I wanted to check in and see how things are progressing with the writer.
It is work in progress. I think the first version should be available sometime in July.
Hi, @yruslan, I hope you are doing well! I wanted to check in and see how things are progressing with the writer.
Still in progress. I think a basic version of a writer is going to be available in the first half of August.
@yruslan , Do we have any updates on this. I hope the basic version will handle comp and comp-3 fields.
There is a basic writer already in the feature branch. Planning for it to go to master next week.
The usage is as follows:
df.write
.format("cobol")
.mode(SaveMode.Overwrite)
.option("copybook_contents", copybookContents)
.save("/some/output/path")It has many limitations:
- GROUPs are not supported. Only flat copybooks, like:
01 RECORD. 05 FIELD_1 PIC X(1). 05 FIELD_2 PIC X(5).
- Only 'PIC X(n)' are supported, no numeric types.
- Only fixed record length output
- REDEFINES, OCCURS are not supported
- Only core ECCDIC encoder is supported, no EBCDIC code pages at the momene.
Thanks yruslan. Couple of questions.
- When can we expect the comp and comp-3 features.
- Currently we have spark cobol and cobol parser version 2.7.7. Do we need to upgrade a newer version to access this new feature.
- instead of the copybook contents, Can I give the copybook path which will have the copybook as a text file and will it work like we do while reading binary files.
Hi @pinakigit ,
- I can't give you timelines, but roughly in 2-3 weeks.
- Yes, you'd have to update to a new version. I might be '2.9.x'
- Yes, as with the reader, you can specify the path to to the copybook.
Also, please remember, GROUPs are also not supported, so the copybook needs to be flat. GROUPs are going to be supported even later.
Thanks again @yruslan.
- Getting the comp and comp-3 changes in 2 to 3 weeks will be amazing.
- Can you please confirm the updated cobix version and where can we get it? I see 2.8.4 in maven which was updated on Jun 2025. Cobrix page also has 2.8.4 version.
Yes we are fine with not having GROUPS as of now.
It is going to be 2.9.0, which is not released yet.
Thanks for the quick response. Please let us know when its released and we will test it out.
Sure, as soon as COMP-3 and COMP support is aded and 2.9.0 is released, will let you know
The release of spark-cobol version 2.9.0 is planned to be next week.
Thanks for the update
Cobrix 2.9.0 is released with the basic writer features. Details are here:
https://github.com/AbsaOSS/cobrix/tree/master?tab=readme-ov-file#ebcdic-writer-experimental
Thanks. I see COBOL Parser in maven for 2.9.0 but not the spark cobol. Spark cobol is still 2.8.4.
As per the documentation, We will only need spark cobol and cobol parser 2.9.0 and won’t need scored and antlr4 anymore. Correct me if I am wrong.
2.9.0 should be in Maven Central. The search index might be lagging.
https://search.maven.org/artifact/za.co.absa.cobrix/spark-cobol_2.12/2.9.0/jar
Yes, scodec and antlr4 are not needed anymore. scoded was removed as a dependency, and antlr4 is shaded together with spark-cobol
Thanks. Checked a couple of files and looks Good. Will check further and let you know in case of any issues.
Appreciate all the efforts you have put in for this !!