Dictionary unification fails across multiple matches in large files

Question

Dictionary unification fails across multiple matches in large files

bmschmidt opened this issue 3 years ago · 3 comments

Reading from a CSV where dictionary types are inferred, multiple batches seem to produce dictionaries that can't be unified if new entries appear not present in the first batch (or something like that).

I thought this was addressed by tile.remap_all_dicts, but it is not.

Not yet reproduced, but log trace below. In this case, fixable by increasing csv_batch_size to float("inf") or equivalent; that won't be possible for larger-than-memory data, though.

DEBUG:quadtiler:Opening overflow on (1, 0, 0)
INFO:quadtiler:Done inserting block 4 of 7
INFO:quadtiler:15 partially filled tiles buffered in memory and 2 flushing overflow directly to disk.
INFO:quadtiler:Inserting block 5 of 7
Traceback (most recent call last):
  File "/Users/ayelton/.virtualenvs/lc_etl-BGe5voTg/bin/quadfeather", line 8, in <module>
    sys.exit(main())
  File "/Users/ayelton/.virtualenvs/lc_etl-BGe5voTg/lib/python3.9/site-packages/quadfeather/tiler.py", line 264, in main
    tiler.insert(tab, remaining_tiles)
  File "/Users/ayelton/.virtualenvs/lc_etl-BGe5voTg/lib/python3.9/site-packages/quadfeather/tiler.py", line 608, in insert
    child_tile.insert(subset, tiles_allowed - tiles_allowed_overflow)
  File "/Users/ayelton/.virtualenvs/lc_etl-BGe5voTg/lib/python3.9/site-packages/quadfeather/tiler.py", line 612, in insert
    self.overflow_buffer.write_batch(
  File "pyarrow/ipc.pxi", line 408, in pyarrow.lib._CRecordBatchWriter.write_batch
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single dictionary for a given field across all batches.

Answer 1 · 2023-06-13T07:27:40.000Z

Hi @bmschmidt,

I just ran into what seems to be the exact same issue with my own CSV file. It's a 4.4GB CSV with three columns (x, y, map_id -> a numeric identifier) and ~94 million rows. Any thoughts on the best workaround for now?

Posting my error message below for posterity:

Traceback (most recent call last):
  File "/home/rainer/Software/miniconda3/bin/quadfeather", line 8, in <module>
    sys.exit(main())
  File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 286, in main
    tiler.insert_files(files = rewritten_files, schema = schema, recoders = recoders)
  File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 356, in insert_files
    self.insert_table(tab, tile_budget = self.args.max_files)
  File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 594, in insert_table
    child_tile.insert_table(subset, tile_budget = tiles_allowed - tiles_allowed_overflow)
  File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 594, in insert_table
    child_tile.insert_table(subset, tile_budget = tiles_allowed - tiles_allowed_overflow)
  File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 594, in insert_table
    child_tile.insert_table(subset, tile_budget = tiles_allowed - tiles_allowed_overflow)
  [Previous line repeated 2 more times]
  File "/home/rainer/Software/miniconda3/lib/python3.9/site-packages/quadfeather/tiler.py", line 598, in insert_table
    self.overflow_buffer.write_batch(
  File "pyarrow/ipc.pxi", line 503, in pyarrow.lib._CRecordBatchWriter.write_batch
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single non-delta dictionary for a given field across all batches.

Answer 2 · 2023-06-13T13:22:44.000Z

FWIW: the conversion works if I use only half of the dataset.

Answer 3 · 2023-06-13T13:35:41.000Z

Root cause is tricky to assess here but I see two options. 1. Clone this repo and modify this line. schema : pa.Schema, csv_block_size : int = 1024*1024*128): https://github.com/bmschmidt/quadfeather/blob/60ba3428784faf36a51bdf40129f2a66980452b2/quadfeather/tiler.py#L142 That affects how many lines of a csv are read in at once—shifting it to something very large should slurp in the whole dataset. At 4.4 gb this might be playing it close depending on your machine. 2. Transfer from csv to parquet before ingesting. Pyarrow, duckdb, and polars all allow incremental building of parquet files larger than memory. The core issue here is that csv is a weakly typed format and apparently the parser is guessing wrong on some field. Given that your types should be float, float, int I don’t know why. But in general ingesting parquet is more predictable. If I were doing it myself I’d use duckdb to write to parquet, being sure to cast x and y to single precision and forcing the ids to be strings.

…

On Tue, Jun 13, 2023 at 9:22 AM Rainer Simon ***@***.***> wrote: FWIW: the conversion works if I use only half of the dataset. — Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIPFZVQ4SLTIEPIWNGUNC3XLBSS5ANCNFSM5LBQMELQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>