Spark DataFrame Writer for Cobol datafiles

Question

Spark DataFrame Writer for Cobol datafiles

Closed this issue 3 months ago · 44 comments

Background

I work for a credit card company in the retail sector, and we are currently utilizing Cobrix to acquire data from our credit card transaction processor and produce business events to Kafka for our event driven architecture and analytic platform.

Thanks to @yruslan and his work with #338 Cobrix is now fully functional for our data ingest use case, however, our electronic data interchange with this business partner is bidirectional.

For example we receive mainframe data transmissions for things like customer purchases, and account status. But we also have to transmit monetary data to our mainframe based partner for things like credits and adjustments, and non-monetary data for account configuration changes including but not limited to change of address.

Additionally, we also believe that such a feature could also be used to simplify the process of creating test data for our system.

Feature

Implement a Spark DataFrame writer for Cobol data, the feature should:

Derive a default copybook layout from the Spark Schema
Support configurable endianness
Support configurable code page output
Support writing Cobol output data files in the F, FB, V, VB file types from https://www.ibm.com/docs/en/zos-basic-skills?topic=set-data-record-formats
Support the writing of a copybook file that matches the output schema as-written
Provide a declarative configuration option to override individual DataFrame Schema -> Copybook transformation decisions at a field level including:
specify width for PIC X(n) fields
specify scale and precision for PIC 9 fields such as S9(11)V99
specify binary packing options for individual fields such as COMP-3

Proposed Solution [Optional]

We could contribute development labor to the implementation of the feature, however we would need assistance with high level design should such a feature be accepted.

At this point I would like to open a discussion about how such a feature might be implemented, and as I mentioned we would be willing to contribute some development labor to help make this feature a reality, but we would need some assistance in the architecture of the solution.

Answer 1 · 2021-08-25T19:57:37.000Z

This sounds great. The demand for the feature seems to exists already, but the feature requires a lot of effort. This could be a good collaboration. As soon as the implementation of VBVR is finished (probably end of the next week), I can prepare a design document for a Cobol fire writer. We can discuss features the writer can support and prioritize features required for your use case. The features that are useful but not immediately required for you we can implement later from our side.
I think the work can be divided as independent tasks and with your help the feature can be implemented much faster.

Answer 2 · 2021-08-26T20:32:48.000Z

I had a meeting to discuss the first draft of these requirements and one of my peers suggested that while dynamically creating a copybook from a Spark schema and declarative configuration was a nice feature, that it might be complex to implement and isn't really necessary for MVP.

My colleague suggested that perhaps a better idea would be to require a copybook layout be passed into the data frame writer, since we would have to set static field sizes for every column in the data frame anyway.

Of course we would have to verify that the DF schema can be mapped to the Copybook schema, but that may be an easier lift than programmatically generating a copybook.

In our use case the copybook is defined by our business partner, and we would have to ensure that the DF we generate can map to the service contract (copybook) that they are expecting.

Also on the subject of narrowing the MVP features, our use case only requires a single code page (I believe it is cp037 but will verify with the business partner), and only big endian.

All of our data ingest code uses CodePageCommon which is working adequately so far.

Answer 3 · 2021-08-27T06:51:08.000Z

Good. We can start looking into requirements in about 2 weeks.

Actually, generating our own copybook from a Spark dataframe is easier since we can choose output data types. Conforming to an existing copybook would require supporting the plethora of formats that COBOL supports (picture, usage, etc). But conforming to an existing copybook is something that is usually required, so that's something that we should implement at some point anyway. And since it matches your use case we can look into that first.

Supporting only cp037 or basic + cp037 is good as well.

Answer 4 · 2021-08-27T06:53:13.000Z

What about data formats? Do you need the support of F, V, VB (RDW, no RDW, BDW+RDW), or we can just start with basic V (RDW)?

Answer 5 · 2021-08-30T16:55:49.000Z

I have a colleague researching this now, but the preliminary answer is that we need FB and VB formats. In a day or two I'll have a final answer and copybooks for you to review.

Answer 6 · 2021-12-14T17:33:59.000Z

Mark is leaving Nordstrom and I will be taking over as a contact for Nordstrom

Answer 7 · 2021-12-14T17:55:34.000Z

@yruslan as @milehighhokie indicated I have accepted a new position in another company and Bill will be taking over this issue for my former employer. We had a turnover meeting this morning, and I reminded him that you are still waiting on copybook examples for the outbound data transfer use case that I outlined in this issue.

I want to extend my thanks for the excellent support I have received while using Cobrix, and in particular I appreciate the opportunity to collaborate with you on adding the new record format readers.

Answer 8 · 2021-12-14T18:05:49.000Z

Thanks for the kind words, Mark! Enjoy the holiday season and the best of luck at the new role!

@milehighhokie , looking forward to future collaboration.

Answer 9 · 2022-09-23T12:30:05.000Z

Hi yruslan, we have a similar requirement for copybook writer. You have closed this issue. Did you make any progress in Spark Dataframe writer for copybook data files?

Answer 10 · 2022-10-06T06:15:47.000Z

Hi, sorry, the writer would require a lot of effort and we don't have the capacity nor internal demand for it at the moment.
But it is in long term plans to o it.

Answer 11 · 2025-01-22T22:30:28.000Z

Any updates on this. Any plans this year to implement this?

Answer 12 · 2025-01-23T02:22:03.000Z

@yruslan - How can we collaborate on this feature with you. We are using cobrix in bank and we are sucessfully using it for mf to cloud data ingestion.However, we have a requirement to enable bidirectional flow to sync back to mf.

Answer 13 · 2025-01-23T12:42:37.000Z

We have potential use cases for writing EBCDIC files as well, but it is not of high priority at the moment.
The writing feature would be very nice to have. A collaboration would definitely help. From our side, we can implement a basic/skeleton functionality of writing EBCDIC files from Spark dataframes. Then, if you have people at your side willing to contribute, the feature can be extended.

By basic functionality I mean writing EBCDIC mainfrme files with:

The output copybook is provided by the user, not generated by Cobrix.
The copybook should use only DISPLAY format for fields (no COMP-3 or binary numbers).
struct types (GROUPs) are okay, but no arrays (OCCURs) initially.
Only basic Unicode to EBCDIC code page.
Only batch output, no streaming support.

In any case, this is a very big endeavour, as complex as reading mainframe files for general use cases.

Answer 14 · 2025-02-16T15:09:45.000Z

We are fine with the basic functionality with “The copybook should use only DISPLAY format for fields (no COMP-3 or binary numbers).” . Let us know how we can proceed

Answer 15 · 2025-03-07T12:49:05.000Z

@yruslan , Can you suggest how can we start with this

Answer 16 · 2025-03-12T07:24:03.000Z

Hi, @pinakigit

I'm going to plan to create a skeleton implementation, and then let you know when a PR is ready.

The skeleton implementation is going to include the basic functionality. New features can be contributed after that.

Thank you,
Ruslan

Answer 17 · 2025-03-13T23:30:42.000Z

Thanks. Eagerly waiting for the skeleton implementation. Will be willing to contribute anyway we can for extending the features.

Answer 18 · 2025-03-31T19:26:39.000Z

@yruslan, Any updates on this?

Answer 19 · 2025-04-01T06:57:31.000Z

No updates so far.

Answer 20 · 2025-04-16T11:47:38.000Z

Hi, @yruslan , Did we make any progress on this?

Answer 21 · 2025-04-24T07:30:08.000Z

No updates so far. Unfortunately too busy this month. Hopefully there would be some progress next month

Answer 22 · 2025-05-28T12:35:18.000Z

Hi, @yruslan , any updtaes on this?

Answer 23 · 2025-05-29T09:08:48.000Z

Started working on it. It might take about a month to have a first writer with bare minimum features

Answer 24 · 2025-05-30T22:56:52.000Z

Thanks. A basic skeleton will be good to start with. Will wait for it.

Answer 25 · 2025-06-18T22:13:03.000Z

Hi, @yruslan, I hope you are doing well! I wanted to check in and see how things are progressing with the writer.

Answer 26 · 2025-06-24T20:33:55.000Z

@yruslan , Do we have any updates on this

Answer 27 · 2025-06-25T07:41:41.000Z

It is work in progress. I think the first version should be available sometime in July.

Answer 28 · 2025-07-24T23:17:35.000Z

Hi, @yruslan, I hope you are doing well! I wanted to check in and see how things are progressing with the writer.

Answer 29 · 2025-07-28T10:06:47.000Z

Still in progress. I think a basic version of a writer is going to be available in the first half of August.

Answer 30 · 2025-08-14T06:49:33.000Z

@yruslan , Do we have any updates on this. I hope the basic version will handle comp and comp-3 fields.

Answer 31 · 2025-08-15T07:22:57.000Z

There is a basic writer already in the feature branch. Planning for it to go to master next week.

The usage is as follows:

  df.write
    .format("cobol")
    .mode(SaveMode.Overwrite)
    .option("copybook_contents", copybookContents)
    .save("/some/output/path")

It has many limitations:

GROUPs are not supported. Only flat copybooks, like:

     01  RECORD.
         05  FIELD_1       PIC X(1).
         05  FIELD_2       PIC X(5).

Only 'PIC X(n)' are supported, no numeric types.
Only fixed record length output
REDEFINES, OCCURS are not supported
Only core ECCDIC encoder is supported, no EBCDIC code pages at the momene.

Answer 32 · 2025-08-19T10:47:05.000Z

The new feature is available in master.

COMP-3 and COMP are feature requests: #776 and #777

Answer 33 · 2025-08-19T19:17:29.000Z

Thanks yruslan. Couple of questions.

When can we expect the comp and comp-3 features.
Currently we have spark cobol and cobol parser version 2.7.7. Do we need to upgrade a newer version to access this new feature.
instead of the copybook contents, Can I give the copybook path which will have the copybook as a text file and will it work like we do while reading binary files.

Answer 34 · 2025-08-20T06:12:32.000Z

Hi @pinakigit ,

I can't give you timelines, but roughly in 2-3 weeks.
Yes, you'd have to update to a new version. I might be '2.9.x'
Yes, as with the reader, you can specify the path to to the copybook.

Also, please remember, GROUPs are also not supported, so the copybook needs to be flat. GROUPs are going to be supported even later.

Answer 35 · 2025-08-20T11:41:03.000Z

Thanks again @yruslan.

Getting the comp and comp-3 changes in 2 to 3 weeks will be amazing.
Can you please confirm the updated cobix version and where can we get it? I see 2.8.4 in maven which was updated on Jun 2025. Cobrix page also has 2.8.4 version.

Yes we are fine with not having GROUPS as of now.

Answer 36 · 2025-08-20T11:43:18.000Z

It is going to be 2.9.0, which is not released yet.

Answer 37 · 2025-08-20T11:54:31.000Z

Thanks for the quick response. Please let us know when its released and we will test it out.

Answer 38 · 2025-08-20T11:55:46.000Z

Sure, as soon as COMP-3 and COMP support is aded and 2.9.0 is released, will let you know

Answer 39 · 2025-09-05T07:31:24.000Z

The release of spark-cobol version 2.9.0 is planned to be next week.

Answer 40 · 2025-09-06T00:02:42.000Z

Thanks for the update

Answer 41 · 2025-09-11T06:52:15.000Z

Cobrix 2.9.0 is released with the basic writer features. Details are here:
https://github.com/AbsaOSS/cobrix/tree/master?tab=readme-ov-file#ebcdic-writer-experimental

Answer 42 · 2025-09-11T14:45:44.000Z

Thanks. I see COBOL Parser in maven for 2.9.0 but not the spark cobol. Spark cobol is still 2.8.4.

As per the documentation, We will only need spark cobol and cobol parser 2.9.0 and won’t need scored and antlr4 anymore. Correct me if I am wrong.

Answer 43 · 2025-09-11T14:49:34.000Z

2.9.0 should be in Maven Central. The search index might be lagging.
https://search.maven.org/artifact/za.co.absa.cobrix/spark-cobol_2.12/2.9.0/jar

Yes, scodec and antlr4 are not needed anymore. scoded was removed as a dependency, and antlr4 is shaded together with spark-cobol

Answer 44 · 2025-09-12T02:29:24.000Z

Thanks. Checked a couple of files and looks Good. Will check further and let you know in case of any issues.

Appreciate all the efforts you have put in for this !!