scylladb/seastar

`dma_write` silently drops `iovec`s exceeding `IOV_MAX`

sukhodolin opened this issue · 1 comments

The program at the bottom reproduces the problem: when the number of iovecs to the dma_write function exceeds IOV_MAX, the remaining entries are silently dropped.

The file is expected to contain 8 megabytes of the letter 'A'. It, however, only contains 4
megabytes of it, and then the rest is zeroes on my machine. To confirm, try doing

> head -c 4194304 ./output.file | tail -c -1
A
> head -c 4194305 ./output.file | tail -c -1
<no output>

The reason is that the writev's documentation says that
the number of iovecs shouldn't exceed the IOV_MAX value (which is 1024 on Linux), while we clearly
give more than IOV_MAX iovecs to the dma_write function (that I believe is implemented in
terms of writev).

So, the behavior of the seastar is correct in the sense that it matches the documentation and it's the
program's fault that it violated the limit on iovec entries. But the problem is extra hard to
debug, so maybe it's worth going an extra mile and adding a check (maybe just to a debug version of
seastar if performance seems an issue) to throw an exception here? This will make it way easier for
a developer to spot an error.

The source code for the reproduction of the issue:

#include <seastar/core/app-template.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/core/file.hh>
#include <seastar/core/reactor.hh>

seastar::logger lg("iov-max");

const char *OUTPUT_FILE_NAME = "output.file";

int main(int argc, char **argv) {
  seastar::app_template app;

  return app.run(argc, argv, [&]() -> seastar::future<int> {
    auto output_file = co_await seastar::open_file_dma(
        OUTPUT_FILE_NAME,
        seastar::open_flags::wo | seastar::open_flags::create);

    // We're going to have 2048 blocks of 4096 bytes each.
    constexpr size_t BUFFER_BLOCKS = 2048;
    constexpr size_t BUFFER_BLOCK_SIZE = 4096;
    constexpr size_t BUFFER_SIZE = BUFFER_BLOCKS * BUFFER_BLOCK_SIZE;

    co_await output_file.truncate(BUFFER_SIZE);
    co_await output_file.allocate(0, BUFFER_SIZE);

    auto blocks = seastar::allocate_aligned_buffer<char>(
        BUFFER_SIZE, output_file.memory_dma_alignment());
    for (size_t i = 0; i < BUFFER_SIZE; ++i) {
      blocks[i] = 'A';
    }

    std::vector<iovec> iovecs;
    for (size_t i = 0; i < BUFFER_BLOCKS; ++i) {
      const char *current_block = blocks.get() + BUFFER_BLOCK_SIZE * i;
      iovecs.emplace_back(iovec{(void *)current_block, BUFFER_BLOCK_SIZE});
    }

    lg.info("Writing {} iovecs", iovecs.size());

    co_await output_file.dma_write(0, iovecs);
    co_await output_file.flush();

    co_return 0;
  });
}

Did you check the return type of dma_write()? If it returned the amount of data actually written, then it didn't silently drop anything.