twitter/hadoop-lzo

Potential thread safety issue with LzoDecompressor

Closed this issue · 8 comments

The problem occurs when trying to read lzo compressed files with spark using sc.textFile(...).
But works fine when using LzoTextInputFormat, with the same dataset and job config.

I encounter multiple

java.lang.InternalError: lzo1x_decompress_safe returned: -6
    at com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native Method)
    at com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:315)
    at com.hadoop.compression.lzo.LzopDecompressor.decompress(LzopDecompressor.java:122)
    at com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:252)
    at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
    at java.io.InputStream.read(InputStream.java:101)
    at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)

And sometime few

Compressed length 892154724 exceeds max block size 67108864 (probably corrupt file)
  at com.hadoop.compression.lzo.LzopInputStream.getCompressedData(LzopInputStream.java:291)

Those happen only when having multiple threads per jvm (multiple executor-cores).
We are using a snapshot version of 0.4.20 starting from this commit.

Thanks

I think this was fixed in #103

Just tried with latest commit but the problem remains

too bad. Does each thread read from a different file or do multiple threads read from the same file? anything you can add here to reproduce easily will be very useful.

So it looks like it is due to something that changed in hadoop 2 and when using the basic textFile method from spark it expects the input to be splittable (in my case the files are not indexed).

Discussed on SO. Anyway using the input format avoids this problem.

Should I close this issue?

So this was in fact because of reader trying to read from an arbitrary offset, right? Thanks for the update.

So will it be fixed in the future version? I hope sc.textFile can decompress and split any input files correctly and automatically.

I don't know if it has been fixed but you can use LzoTextInputFormat with the lower level api methods where you can specify the input format to avoid this problem.

@rangadi yeah this is the problem. The reader thinks the input is splittable and tries to read at an arbitrary offset which yields to an invalid format. For small files that don't need to be splitted in theory the problem should not happen.

I meet the same problem. is it fixed on higher version? I am now using hadoop-lzo-0.4.19.jar ( it looks like be published on 2011 or 2013, too old).