apache/datasketches-cpp

merge_sorted_blocks_direct

Closed this issue · 4 comments

While testing kll sketch, I found that VS complains about an invalid iterator in method merge_sorted_blocks_direct.

The main problem is:

  const auto chunk_begin = temp.begin() + temp.size(); // this is the same as temp.end(), or at least, IS_TRUE(temp.end() == chunk_begin)
  merge_sorted_blocks_reversed(orig, temp, levels, starting_level_1, num_levels_1); // these lines add elements without reallocation
  merge_sorted_blocks_reversed(orig, temp, levels, starting_level_2, num_levels_2);
  const uint32_t num_items_1 = levels[starting_level_1 + num_levels_1] - levels[starting_level_1];
  std::merge( // This line makes VS in Debug crash
    std::make_move_iterator(chunk_begin), std::make_move_iterator(chunk_begin + num_items_1),
    std::make_move_iterator(chunk_begin + num_items_1), std::make_move_iterator(temp.end()),
    orig.begin() + levels[starting_level], compare_pair_by_first<C>()
  );
  temp.erase(chunk_begin, temp.end());

I don't have a quote from the standard, but I think VS is right (and that the code in the function is UB): there is no guarantee that temp.end() is actually a pointer to the element following the last. Also, I found an equivalent question on stackoverflow that seems to agree to my interpretation.

So while in practice it doesn't make any difference as vector iterators are implemented through pointers (in release mode with VS and in any mode with gcc 9.3, including with -D_GLIBCXX_DEBUG), I think it could be improved.

A possible (while probably not really elegant) work-around could be:

diff --git a/kll/include/kll_quantile_calculator_impl.hpp b/kll/include/kll_quantile_calculator_impl.hpp
index 23efa4d..ff6e547 100644
--- a/kll/include/kll_quantile_calculator_impl.hpp
+++ b/kll/include/kll_quantile_calculator_impl.hpp
@@ -129,10 +129,11 @@ void kll_quantile_calculator<T, C, A>::merge_sorted_blocks_direct(Container& ori
   const uint8_t num_levels_2 = num_levels - num_levels_1;
   const uint8_t starting_level_1 = starting_level;
   const uint8_t starting_level_2 = starting_level + num_levels_1;
-  const auto chunk_begin = temp.begin() + temp.size();
+  const auto initial_size = temp.size();
   merge_sorted_blocks_reversed(orig, temp, levels, starting_level_1, num_levels_1);
   merge_sorted_blocks_reversed(orig, temp, levels, starting_level_2, num_levels_2);
   const uint32_t num_items_1 = levels[starting_level_1 + num_levels_1] - levels[starting_level_1];
+  const auto chunk_begin = temp.begin() + initial_size;
   std::merge(
     std::make_move_iterator(chunk_begin), std::make_move_iterator(chunk_begin + num_items_1),
     std::make_move_iterator(chunk_begin + num_items_1), std::make_move_iterator(temp.end()),

Please, correct me if I'm wrong about this.

Yes, I believe you are right. The chunk_begin iterator must be invalidated when container is modified. And yes, it does not matter if iterators are pointers. Apparently in debug mode this might not be the case, and can lead to undefined behavior. And yes, I believe your proposed fix is fine. We have to instantiate this iterator after modifying the container.

I think this is a simple enough fix, but I can make a PR if you need.

If you wish. I will get to this in a day or two.