Dictionary unification fails across multiple matches in large files
bmschmidt opened this issue · 3 comments
Reading from a CSV where dictionary types are inferred, multiple batches seem to produce dictionaries that can't be unified if new entries appear not present in the first batch (or something like that).
I thought this was addressed by tile.remap_all_dicts
, but it is not.
Not yet reproduced, but log trace below. In this case, fixable by increasing csv_batch_size to float("inf")
or equivalent; that won't be possible for larger-than-memory data, though.
DEBUG:quadtiler:Opening overflow on (1, 0, 0)
INFO:quadtiler:Done inserting block 4 of 7
INFO:quadtiler:15 partially filled tiles buffered in memory and 2 flushing overflow directly to disk.
INFO:quadtiler:Inserting block 5 of 7
Traceback (most recent call last):
File "/Users/ayelton/.virtualenvs/lc_etl-BGe5voTg/bin/quadfeather", line 8, in <module>
sys.exit(main())
File "/Users/ayelton/.virtualenvs/lc_etl-BGe5voTg/lib/python3.9/site-packages/quadfeather/tiler.py", line 264, in main
tiler.insert(tab, remaining_tiles)
File "/Users/ayelton/.virtualenvs/lc_etl-BGe5voTg/lib/python3.9/site-packages/quadfeather/tiler.py", line 608, in insert
child_tile.insert(subset, tiles_allowed - tiles_allowed_overflow)
File "/Users/ayelton/.virtualenvs/lc_etl-BGe5voTg/lib/python3.9/site-packages/quadfeather/tiler.py", line 612, in insert
self.overflow_buffer.write_batch(
File "pyarrow/ipc.pxi", line 408, in pyarrow.lib._CRecordBatchWriter.write_batch
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single dictionary for a given field across all batches.
Hi @bmschmidt,
I just ran into what seems to be the exact same issue with my own CSV file. It's a 4.4GB CSV with three columns (x, y, map_id -> a numeric identifier) and ~94 million rows. Any thoughts on the best workaround for now?
Posting my error message below for posterity:
Traceback (most recent call last):
File "/home/rainer/Software/miniconda3/bin/quadfeather", line 8, in <module>
sys.exit(main())
File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 286, in main
tiler.insert_files(files = rewritten_files, schema = schema, recoders = recoders)
File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 356, in insert_files
self.insert_table(tab, tile_budget = self.args.max_files)
File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 594, in insert_table
child_tile.insert_table(subset, tile_budget = tiles_allowed - tiles_allowed_overflow)
File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 594, in insert_table
child_tile.insert_table(subset, tile_budget = tiles_allowed - tiles_allowed_overflow)
File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 594, in insert_table
child_tile.insert_table(subset, tile_budget = tiles_allowed - tiles_allowed_overflow)
[Previous line repeated 2 more times]
File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 598, in insert_table
self.overflow_buffer.write_batch(
File "pyarrow/ipc.pxi", line 503, in pyarrow.lib._CRecordBatchWriter.write_batch
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single non-delta dictionary for a given field across all batches.
FWIW: the conversion works if I use only half of the dataset.