Processing .gz files

Question

Processing .gz files

Opened this issue a year ago · 3 comments

Hi Team,
While processing .gz files using Cobrix we are getting the error like as follows

There are some files in abc.gz that are NOT DIVISIBLE by the RECORD SIZE calculated from the copybook (3018 bytes per record). Check the logs for the names of the files.

but my abc.gz has only one file. Is cobrix supports .gz file processing? if not can we pass inputstream to cobrix instead of file ?

Answer 1 · 2023-06-07T11:05:43.000Z

Hi @srinubabuin ,

No, compression is not supported, and neither are inputStreams (although I'm not 100% sure what do you mean there).

The best option is to unpack the file first.

Answer 2 · 2023-06-07T11:25:56.000Z

Hi Yruslan,
https://github.com/AbsaOSS/cobrix/blob/master/spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala
In this code finally we are finally reading BufferedFSDataInputStream with filePath, so here can i pass directly BufferedFSDataInputStream instead of filePath?

private var bufferedStream = new BufferedFSDataInputStream(getHadoopPath(filePath), fileSystem, startOffset, Constants.defaultStreamBufferInMB, maximumBytes)

Answer 3 · 2023-06-07T12:05:48.000Z

Sorry I'm not sure I understand. Keep in mind that the file will be ready in Executors, not on the driver node, and you cannot pass the stream from the driver to an executor. You need to create the stream on the executor. But you can create this stream from the file path.

Alternatively, you can use RDDs to read and uncompress input files, and then apply the record extractor to it. the example is called "Working example 3 - Using RDDs and record parsers directly" from https://github.com/AbsaOSS/cobrix