dacort/faker-cli

Saving Parquet to S3 on M1 mac results in a segfault

dacort opened this issue · 5 comments

dacort commented

A simple fake command to output parquet data to S3 results in a segfault.

❯ fake -n 1000 pyint,user_name,date_this_year -c id,awesome_name,last_attention_at -f parquet -o s3://<BUCKET>/data/sample.parquet
zsh: segmentation fault  fake -n 1000 pyint,user_name,date_this_year -c  -f parquet -o 

Some more details on the stack trace when trying to write without being authenticated:

Traceback (most recent call last):
  File "/private/tmp/faker/.venv/bin/fake", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/private/tmp/faker/.venv/lib/python3.11/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/tmp/faker/.venv/lib/python3.11/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/private/tmp/faker/.venv/lib/python3.11/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/tmp/faker/.venv/lib/python3.11/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/tmp/faker/.venv/lib/python3.11/site-packages/faker_cli/cli.py", line 84, in main
    writer.close()
  File "/private/tmp/faker/.venv/lib/python3.11/site-packages/faker_cli/writer.py", line 59, in close
    pq.write_table(self.table, self.filename)
  File "/private/tmp/faker/.venv/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 3084, in write_table
    with ParquetWriter(
         ^^^^^^^^^^^^^^
  File "/private/tmp/faker/.venv/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 995, in __init__
    sink = self.file_handle = filesystem.open_output_stream(
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_fs.pyx", line 868, in pyarrow._fs.FileSystem.open_output_stream
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: When initiating multiple part upload for key 'data/sample.parquet' in bucket '<BUCKET>': AWS Error ACCESS_DENIED during CreateMultipartUpload operation: Anonymous users cannot initiate multipart uploads.  Please authenticate.
zsh: segmentation fault  fake -n 1000 pyint,user_name,date_this_year -c  -f parquet -o 
dacort commented

Note that the parquet file actually gets written...

dacort commented

And even just running a simple script in the repl, then exiting, results in a segfault...

from faker import Faker
from faker_cli.writer import ParquetWriter
fake = Faker()
num_rows=10
format='parquet'
col_types='pyint,user_name,date_this_year'.split(',')
headers='id,awesome_name,last_attention_at'.split(',')
output = 's3://dcortesi-demo-code-us-west-2/data/sample.parquet'
writer = ParquetWriter(None, headers, output)
for i in range(num_rows):
     row = [ fake.format(ctype) for ctype in col_types ]
     writer.write(row)
 
writer.close()

zsh: segmentation fault  python3

Replaced sys.stdout with None to see if that was the issue.

dacort commented

Upgrading to pyarrow==13.0.0 seems to resolve the issue.

Oddly enough this also happens on my Intel mac. When running in poetry shell with the source code it works fine. When pip installing, it doesn't. Even after upgrading pyarrow.

Fixed in #7