pipelinedb/pipeline_kafka

Rework Kafka -> COPY path

derekjn opened this issue · 2 comments

After pipelinedb/pipelinedb#1596 we can't rely on copy_iter_hook to pass messages into COPY. Some approaches we can use instead:

  • Write batches to temp file(s) and pass paths to COPY (simple but potentially slow)
  • Project rows and write directly to queues via ZMQ (fast but duplicates parsing/deserialization logic that COPY already performs)

The second approach is probably ideal, as COPY deserialization logic is fairly straightforward and not likely to ever change.

What about mmap'd files?

What about mmap'd files?

I don't think there's any guarantee that they'd be faster than regular disk-backed files. Unless I'm mistaken, mmap maps addresses on disk to memory addresses, but doesn't necessarily guarantee that all of the file's contents are kept in memory.

That being said, if we go with the first option we'd want to use mmap. I just don't think it's fundamentally different performance-wise, especially since we'd just be doing sequential writes to the temp file.