JohnSnowLabs/spark-nlp-workshop

Unexpected format when saving to S3 from EMR

kyleiwaniec opened this issue · 3 comments

I am using spark-nlp to preprocess some text on AWS EMR. As long as I'm in the EMR environment everything works as expected. I have a pipeline that looks like this:
image

I then save the result to S3:
image

I can open the saved files in spark on EMR without issue:
image

So far so good!

The problem

The problem is that when I try to download the processed text from S3 and open it on my computer I get something that looks like HEX:
image

If I try to read it with spark on my local computer it also looks wrong:
image

I have tried to save it using UTF8 encoding and compression 'none', as well as CSV format. None of these options made any difference.

I don't know if the issue is with Spark or S3, but for the life of me I cannot find anything on google to figure out what's happening.

I need this text as plain text to feed to another (non-spark) model downstream. Please help :)

Thank you in advance,
Kyle

Your Environment

EMR:

image

image

image

S3

  • Encryption is disabled
  • The bucket is public

Local environment:

image

I tried saving it as parquet as well just see if I could open it locally this way. Again, if I open the parquet file in EMR there is no issue. However, when I try to open the parquet file locally I get this error:

data/part-00000-a5034bfc-1413-40c3-be91-8ee84a714016-c000.snappy.parquet is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [-91, -65, -120, -38]

Also tried with 3.4.4 version of spark nlp with the same results.
Also tried with EMR version 6.9.0 with the same results

This is an AWS config issue specific to my account. Hence closing.