Empty line at the end causes cobrix to create 1 more record

Question

Empty line at the end causes cobrix to create 1 more record

MaksymFedorchuk opened this issue 4 years ago · 4 comments

For example we have file like this with empty line at the end :
fdhfhdsfsdff
dfdhjfwdsdd
dsfddkkfkgk

And if I read it by specifying
spark.read
.format("cobol")
.option("is_record_sequence", "true")
.option("is_text", "true")
.option("encoding", "ascii")
.option("copybook", path_to_copybook)
.load(path_to_file)

I get 4 records instead of 3, so is that a bug or it can be fixed by some option?

Answer 1 · 2021-07-08T08:23:47.000Z

Can I ask you to attach the test file?
I just want to check if the empty line contains no characters or at least 1 character.
When reading text files Cobrix filters out empty lines, but since Windows uses CR LF line ending characters, and Linux/MacOs uses just LF, it is possible that one character ends up in the last record.
I'll check the file and determine if it is a bug or a feature. It's more likely to be a bug though

Answer 2 · 2021-07-08T08:59:07.000Z

testfile.txt

Answer 3 · 2021-07-08T09:58:40.000Z

I've noticed something interesting. Try removing option("is_record_sequence", "true") and please let me know if it worked as expected

Answer 4 · 2021-07-08T12:12:47.000Z

Bug confirmed. It happens when is_text = true and is_record_sequence = true