StanfordLegion/legion

Large Volume Overflow

mpokorny opened this issue · 15 comments

The following program:

#include <cassert>
#include <legion.h>
#include <limits>
#include <vector>

Legion::coord_t min;
Legion::coord_t max;

void
main_task(
  const Legion::Task* task,
  const std::vector<Legion::PhysicalRegion>& /*regions*/,
  Legion::Context ctx,
  Legion::Runtime* rt) {

  std::vector<Legion::Rect<3>> rects{
    Legion::Rect<3>{{12, min, min}, {12, max, max}},
    Legion::Rect<3>{{14, min, min}, {14, max, max}}};
  auto is = rt->create_index_space(ctx, rects);
  Legion::Domain dom = rt->get_index_space_domain(is);
  assert(!dom.contains(Legion::Point<3>{13, 0, 0}));
}

int
main(int argc, char* argv[]) {
  assert(argc >= 3);
  min = std::strtoll(argv[1], nullptr, 0);
  max = std::strtoll(argv[2], nullptr, 0);
  Legion::TaskVariantRegistrar registrar(0, "DomainTest");
  registrar.add_constraint(
    Legion::ProcessorConstraint(Legion::Processor::LOC_PROC));
  Legion::Runtime::preregister_task_variant<main_task>(registrar, "DomainTest");

  Legion::Runtime::set_top_level_task_id(0);
  return Legion::Runtime::start(argc, argv);
}

Succeeds or fails depending on the values of min and max (well within the limits of the index coordinate type) provided on the command line. For example

$ test/DomainTest -1000000000 1000000000
$ test/DomainTest -10000000000 10000000000
...DomainTest.cpp:19: void main_task(const Legion::Task*, const std::vector<Legion::PhysicalRegion>&, Legion::Context, Legion::Runtime*): Assertion `!dom.contains(Legion::Point<3>{13, 0, 0})' failed.
Aborted (core dumped)

This is actually a Realm bug. Here's an even smaller Realm-only reproducer:

#include <cassert>
#include <realm.h>
#include <vector>

using namespace Realm;

typedef long long coord_t;

int main(int argc, char **argv) {

  Runtime rt;
  rt.init(&argc, &argv);
  
  std::vector<Rect<3,coord_t> > rects{
    Rect<3,coord_t>{{12, -10000000000, -10000000000}, {12, 10000000000, 10000000000}},
    Rect<3,coord_t>{{14, -10000000000, -10000000000}, {14, 10000000000, 10000000000}}};

  IndexSpace<3,coord_t> is(rects);
  IndexSpace<3,coord_t> tight = is.tighten();

  Point<3,coord_t> point{13, 0, 0};
  assert(!is.contains(point));
  assert(!tight.contains(point));

  rt.shutdown();
  return rt.wait_for_shutdown();
}

Something is going wrong in the tighten function call of the index space because the first assertion in this program succeeds, but the second assertion will fail. The tightening operation shouldn't change the set of points in the index space so both assertions should succeed. If I look at the two index spaces in the debugger, the first one correctly has a sparsity map, but the second one does not.

(gdb) p is
$1 = {
  bounds = {
    lo = {
      x = 12,
      y = -10000000000,
      z = -10000000000
    },
    hi = {
      x = 14,
      y = 10000000000,
      z = 10000000000
    }
  },
  sparsity = {
    id = 3458764513820540929
  }
}
(gdb) p tight
$2 = {
  bounds = {
    lo = {
      x = 12,
      y = -10000000000,
      z = -10000000000
    },
    hi = {
      x = 14,
      y = 10000000000,
      z = 10000000000
    }
  },
  sparsity = {
    id = 0
  }
}

It shouldn't be possible for Realm to get rid of the sparsity map here as the two rectangles are not touching in space, so clearly the tighten call has a bug in it.

@eddy16112 @artempriakhin @muraj: Can one of you guys take a look at this failure?

FYI, although this is clearly a bug, there's no urgency to fix this issue from my immediate perspective. I identified a reasonable workaround in my application where the failure occurred originally while creating the reproducer I provided here.

I identified the issue in Realm. It's actually the code that computes the volume of a rectangle that is at "fault" here:

https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/point.inl?ref_type=heads#L498-520

The volume of this rectangle is larger than what can be represented by a 64-bit unsigned integer (size_t) so we get an unsigned integer overflow which is defined, but does cause the tighten logic to fail when it gets bad rectangle volumes.

Okay, so I meant to write new unit tests for IndexSpace/Rect that includes tighten and volume anyway, so am I going to do that and part of the PR will be having test that reproduces and fixes the bug.

I suspect this will actually go deeper than just fixing volume (conceptually there's nothing wrong with the logic in tighten, it just doesn't pick a big enough integral type to represent the volume of the rectangles). There are lots of places in Realm and Legion that we compute the volume of rectangles/index spaces and right now we assume those volumes can be stored in a size_t (usually a 64-bit unsigned integer). However, especially in higher dimensions, those volumes can get much bigger than size_t, e.g. the max volume of a rectangle expressed mathematically in terms of size_t (for long long coordinate types) can be:

DIM 1: std::numeric_limits<size_t>+1
DIM 2: (std::numeric_limits<size_t>+1)^2
DIM 3: (std::numeric_limits<size_t>+1)^3

So clearly we need a bigger type for representing those potential volumes, and that has to be true both inside the volume method and on the caller side as well. Unfortunately c++ doesn't provide a built in type large enough for that...

Perhaps for now just detecting the overflow and issuing an error will be enough to not give silently bad results.

Here's a diff that would catch two different kinds of overflow in the volume computation:

diff --git a/runtime/realm/point.inl b/runtime/realm/point.inl
index 1ce9defef..3df98967f 100644
--- a/runtime/realm/point.inl
+++ b/runtime/realm/point.inl
@@ -504,8 +504,14 @@ namespace Realm {
       else {
         // have to convert both 'hi' and 'lo' to size_t before subtracting
         //  to avoid potential signed integer overflow
-       v *= (static_cast<size_t>(hi[i]) -
-              static_cast<size_t>(lo[i]) + 1);
+        size_t range = static_cast<size_t>(hi[i]) - static_cast<size_t>(lo[i]);
+        // Watch for overflow on the range since we have to increment by one
+        assert(range < SIZE_MAX);
+        range++;
+        size_t vnew = v * range;
+        // Watch for overflow on the multiplication
+        assert((vnew/range) == v);
+        v = vnew;
       }
     return v;
   }

@lightsighter do you want to send out the patch? I think for now it's okay to acknowledge and check-fail in volume(). We can discuss on how to actually do a proper fix as we go..with a custom type to hold the sizes...etc.

There are probably multiple places where this need attention

unless this is actually blocking @mpokorny now in some ways?

@apryakhin, this issue isn't blocking me, I have a very reasonable workaround in my application code in place, and I don't foresee having any problems with it for a while.

@mpokorny Do you expect to have rectangles like this often that have volumes that are bigger than 2^64?

No. Initially I chose a volume that large to be as conservative as possible in the construction of that index space, which ultimately gets intersected with another index space. But in my application the bounds on the latter index space are known to be significantly smaller -- even smaller than the range I used in the passing test case above -- so in practice a volume of 2^64 is more than sufficient.

That's actually very helpful to hear. I was pretty worried there for a moment because there's going to be a lot of code in both Legion and Realm that will need to be updated to deal with this. 😅 I think what we'll probably do is put in some detection code for now to detect these overflows as we gradually move to handling the large volume cases. For now keeping your rects under 2^64 total points should be a good rule of thumb to avoiding these issues, but thanks for finding it and reporting it.

As discussed offline we are going to work on initial planning for a long-term fix to handle large volume sizes..across realm. This won't be P0 though at the moment. The immediate action would be to add overflow-checkers across the code base.

Adding myself also since Legion also needs to check for large volume overflow.

The draft is out:

Once I fix CI etc...will send it out for the full review