Unexpected format when saving to S3 from EMR

Question

Unexpected format when saving to S3 from EMR

kyleiwaniec opened this issue 2 years ago · 3 comments

I am using spark-nlp to preprocess some text on AWS EMR. As long as I'm in the EMR environment everything works as expected. I have a pipeline that looks like this:

I then save the result to S3:

I can open the saved files in spark on EMR without issue:

So far so good!

The problem

The problem is that when I try to download the processed text from S3 and open it on my computer I get something that looks like HEX:

If I try to read it with spark on my local computer it also looks wrong:

I have tried to save it using UTF8 encoding and compression 'none', as well as CSV format. None of these options made any difference.

I don't know if the issue is with Spark or S3, but for the life of me I cannot find anything on google to figure out what's happening.

I need this text as plain text to feed to another (non-spark) model downstream. Please help :)

Thank you in advance,
Kyle

Your Environment

EMR:

S3

Encryption is disabled
The bucket is public

Local environment:

Answer 1 · 2022-12-03T10:24:38.000Z

I tried saving it as parquet as well just see if I could open it locally this way. Again, if I open the parquet file in EMR there is no issue. However, when I try to open the parquet file locally I get this error:

data/part-00000-a5034bfc-1413-40c3-be91-8ee84a714016-c000.snappy.parquet is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [-91, -65, -120, -38]

Answer 2 · 2022-12-03T18:21:29.000Z

Also tried with 3.4.4 version of spark nlp with the same results.
Also tried with EMR version 6.9.0 with the same results

Answer 3 · 2022-12-04T19:20:59.000Z

This is an AWS config issue specific to my account. Hence closing.