bmschmidt/quadfeather

Dictionary unification fails across multiple matches in large files

bmschmidt opened this issue · 3 comments

Reading from a CSV where dictionary types are inferred, multiple batches seem to produce dictionaries that can't be unified if new entries appear not present in the first batch (or something like that).

I thought this was addressed by tile.remap_all_dicts, but it is not.

Not yet reproduced, but log trace below. In this case, fixable by increasing csv_batch_size to float("inf") or equivalent; that won't be possible for larger-than-memory data, though.

DEBUG:quadtiler:Opening overflow on (1, 0, 0)
INFO:quadtiler:Done inserting block 4 of 7
INFO:quadtiler:15 partially filled tiles buffered in memory and 2 flushing overflow directly to disk.
INFO:quadtiler:Inserting block 5 of 7
Traceback (most recent call last):
  File "/Users/ayelton/.virtualenvs/lc_etl-BGe5voTg/bin/quadfeather", line 8, in <module>
    sys.exit(main())
  File "/Users/ayelton/.virtualenvs/lc_etl-BGe5voTg/lib/python3.9/site-packages/quadfeather/tiler.py", line 264, in main
    tiler.insert(tab, remaining_tiles)
  File "/Users/ayelton/.virtualenvs/lc_etl-BGe5voTg/lib/python3.9/site-packages/quadfeather/tiler.py", line 608, in insert
    child_tile.insert(subset, tiles_allowed - tiles_allowed_overflow)
  File "/Users/ayelton/.virtualenvs/lc_etl-BGe5voTg/lib/python3.9/site-packages/quadfeather/tiler.py", line 612, in insert
    self.overflow_buffer.write_batch(
  File "pyarrow/ipc.pxi", line 408, in pyarrow.lib._CRecordBatchWriter.write_batch
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single dictionary for a given field across all batches.

Hi @bmschmidt,

I just ran into what seems to be the exact same issue with my own CSV file. It's a 4.4GB CSV with three columns (x, y, map_id -> a numeric identifier) and ~94 million rows. Any thoughts on the best workaround for now?

Posting my error message below for posterity:

Traceback (most recent call last):
  File "/home/rainer/Software/miniconda3/bin/quadfeather", line 8, in <module>
    sys.exit(main())
  File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 286, in main
    tiler.insert_files(files = rewritten_files, schema = schema, recoders = recoders)
  File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 356, in insert_files
    self.insert_table(tab, tile_budget = self.args.max_files)
  File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 594, in insert_table
    child_tile.insert_table(subset, tile_budget = tiles_allowed - tiles_allowed_overflow)
  File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 594, in insert_table
    child_tile.insert_table(subset, tile_budget = tiles_allowed - tiles_allowed_overflow)
  File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 594, in insert_table
    child_tile.insert_table(subset, tile_budget = tiles_allowed - tiles_allowed_overflow)
  [Previous line repeated 2 more times]
  File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 598, in insert_table
    self.overflow_buffer.write_batch(
  File "pyarrow/ipc.pxi", line 503, in pyarrow.lib._CRecordBatchWriter.write_batch
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single non-delta dictionary for a given field across all batches.

FWIW: the conversion works if I use only half of the dataset.