tc39/proposal-math-sum

Deterministic sum

Opened this issue · 12 comments

I propose making the sum be the deterministic mathematical sum of the inputs, rounded to the nearest double, breaking ties to even in the typical IEEE double fashion. The special cases are as for regular addition:

  • If any of the addends are NaN, the result is NaN.
  • Otherwise, if +∞ and -∞ both appear in the addends, the result is NaN.
  • Otherwise, if +∞ appears in the addends, the result is +∞.
  • Otherwise, if -∞ appears in the addends, the result is -∞.
  • Otherwise, if all of the addends are -0 (including the case where there are no addends), the result is -0.
  • Otherwise all the addends are finite. The result is the exact mathematical sum of the mathematical values of the addends, converted to a Number (note that the final conversion might overflow to ±∞).

This can be implemented as follows, using the exact intermediate sum technique which takes an arbitrary number of finite IEEE double addends a0, a1, …, an-1 and, barring intermediate or final overflows, represents their exact mathematical sum a0 + a1 + … + an-1 as s = p0 + p1 + …+ pk-1, where the p's are IEEE doubles, all have the same sign, are in descending magnitude order, are non-overlapping, meaning that each successive p's exponent is at least 53 less than the previous one, and are nonzero (except possibly p0).

  1. If there are no addends, the result is -0.
  2. If there is one addend a0, the result is a0.
  3. If there are two addends a0 and a1, the result is a0 + a1 using ordinary IEEE double arithmetic.
  4. If any of the addends are non-finite (±∞ or NaN), ignore all of the finite addends, and return the sum of the non-finite addends using ordinary IEEE double arithmetic done in any order.
  5. If all of the addends are -0, return -0 (depending on the implementation of exact intermediate sums, this case might fall out of the implementation of step 6 without needing to explicitly check for it.)
  6. Initialize an exact intermediate sum t = «pi» to t = «». Successively add addends into t using the exact intermediate sum technique above. To avoid false intermediate overflows, when picking the next addend to add into t, always prefer to pick an addend with the opposite sign from t if any exist. This way any intermediate overflow that happens is a true overflow and the mathematical sum will not be cancelled by later addends back into the finite IEEE double range.
  7. Round p0 + p1 + …+ pk-1 to an IEEE double by adding them using double arithmetic and return that result, taking special care to adjust the lsb around the round-to-even case in the case of a mathematical result exactly in between two representable doubles (see Python's math_fsum code for an explanation of this).

This is the simplest approach. The algorithm is asymptotically linear in the number of addends n and efficient, typically having k no more than 2.

The steps can be folded into a mostly or fully single-pass algorithm. For example:

  • Step 4 can be done by checking addends as they arrive and abandoning any finite addend sum the first time a non-finite addend is seen.
  • Step 6 scans for addends of the opposite sign from the running total. This can be done lazily by only doing this scan if an intermediate overflow actually happens.

If one wishes to do a fully single-pass algorithm, one can avoid the scan for addends of the opposite sign from the running total by scaling down the most significant partial p0 and values added into it by 52 powers of 2 if it would overflow, while leaving the other partials p1, …, pk-1 unscaled. This is a bit tedious requiring scaling when working across the first and second partial but can be done efficiently.

If the partial values «p0, p1, …, pk-1» can differ in sign, then it's possible to get obscure cases where the mathematical value about to be stored into p0 is on a knife edge that rounds to ±∞ while p1 would have the opposite sign that would cause the rounding to prefer the largest-magnitude finite value. This would cause a problem for the pick-next-addend-with-opposite-sign approach, but would be handled by the single-pass scaling-down-p0 approach.

Here is an implementation using the single-pass scaling approach. This was indeed a bit tedious, but I think it's correct - I also wrote code to convert doubles to/from BigInts (scaled up by 2**1023), in which domain we can do addition/subtraction precisely, and used this to fuzz my implementation. This caught a few nasty edge cases. It's possible I might still have missed a case somewhere.

Note that, following Python, the partials are in ascending order of magnitude, so the "overflow" partial is conceptually last in this implementation, not first.

In the original issue for adding fsum to python, there is attached a version based on a different approach:

In brief, the accumulated sum-so-far is represented in the form

huge_integer * 2**(smallest_possible_exponent)

and the huge_integer is stored in base 2**30, with a signed-digit representation (digits in the range [-2**29, 2**29).

Because this approach doesn't use 2Sum, it doesn't have issues with overflow.

I've extracted it from the diff in the thread. I'm probably not going to bother implementing this in JS as well, but it does prove the possibility.

lsum.c
/* Full precision summation of a sequence of floats.  Based on the
   function lsum by Raymond Hettinger at
    <http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/393090>,
   and following implementation suggestions from Tim Peters.
 
   Method: maintain the sum so far in the form
 
     huge_integer * 2**FSUM_MIN_EXP
 
   where FSUM_MIN_EXP is such that every finite double can be
   expressed as an integral multiple of 2**FSUM_MIN_EXP.  huge_integer
   is stored in base FSUM_BASE.  Following a suggestion from Tim
   Peters, we use a signed-digit representation: the digits for the
   base FSUM_BASE representation all lie in the range [-FSUM_BASE/2,
   FSUM_BASE/2).  It's not hard to show that every integer (positive
   or negative), can be expressed uniquely in the form
 
     a0 + a1*FSUM_BASE + a2*FSUM_BASE**2 + ...
 
   This choice of representation makes it easy to deal with both
   positive and negative summands.

   Note: The signature of math.fsum() differs from __builtin__.sum()
    because the start argument doesn't make sense in the context of
   accurate summation.  sum(seq2, start=sum(seq1)) may not equal the
    accurate result returned by sum(itertools.chain(seq1, seq2)).
 */

#define FSUM_MIN_EXP (DBL_MIN_EXP - DBL_MANT_DIG)
#define FSUM_SIZE 30
#define FSUM_BASE ((long) 1 << FSUM_SIZE)
/* allow at least 60 extra bits at the top end, to cope with intermediate
   overflow.  On an IEEE 754 machine, FSUM_ACC_SIZE is 72. */
#define FSUM_ACC_SIZE ((DBL_MAX_EXP - FSUM_MIN_EXP + 60) / FSUM_SIZE + 1)

/* _fsum_twopower[i] is 2.0**(i+1) */
static
const double _fsum_twopower[FSUM_SIZE] = {
  2.0,
  4.0,
  8.0,
  16.0,
  32.0,
  64.0,
  128.0,
  256.0,
  512.0,
  1024.0,
  2048.0,
  4096.0,
  8192.0,
  16384.0,
  32768.0,
  65536.0,
  131072.0,
  262144.0,
  524288.0,
  1048576.0,
  2097152.0,
  4194304.0,
  8388608.0,
  16777216.0,
  33554432.0,
  67108864.0,
  134217728.0,
  268435456.0,
  536870912.0,
  1073741824.0,
};

static PyObject * math_fsum(PyObject * self, PyObject * seq) {
  PyObject * item, * iter;
  double x, m, inf_sum = 0.0, special_sum = 0.0;
  int e, q, q_top, i, round_up, ndigits, top_exp;
  long digit, half_eps, acc[FSUM_ACC_SIZE], carry, sign;

  iter = PyObject_GetIter(seq);
  if (iter == NULL)
    return NULL;

  /* initialize accumulator */
  for (i = 0; i < FSUM_ACC_SIZE; i++)
    acc[i] = 0;

  /********************************************
   * Stage 1: accumulate values from iterable *
   ********************************************/

  for (;;) {
    item = PyIter_Next(iter);
    if (item == NULL) {
      Py_DECREF(iter);
      if (PyErr_Occurred())
        return NULL;
      break;
    }
    x = PyFloat_AsDouble(item);
    Py_DECREF(item);
    if (PyErr_Occurred()) {
      Py_DECREF(iter);
      return NULL;
    }

    /* accumulate specials in special_sum, infs in inf_sum */
    if (!Py_IS_FINITE(x)) {
      if (Py_IS_INFINITY(x))
        inf_sum += x;
      special_sum += x;
      continue;
    }

    m = frexp(x, & e);
    e -= (FSUM_MIN_EXP + 1);

    /* add digits of m into accumulator */
    m *= _fsum_twopower[e % FSUM_SIZE];
    q_top = q = e / FSUM_SIZE;
    for (;;) {
      digit = (long) m;
      m -= digit;
      acc[q] += digit;
      if (m == 0.0)
        break;
      m *= (double) FSUM_BASE;
      q--;
    }

    /* normalize accumulator */
    for (;;) {
      if (acc[q] < -FSUM_BASE / 2) {
        acc[q] += FSUM_BASE;
        acc[++q]--;
      } else if (acc[q] >= FSUM_BASE / 2) {
        acc[q] -= FSUM_BASE;
        acc[++q]++;
      } else if (++q > q_top)
        break;
    }
  }

  /***************************************************
   * Stage 2: compute correctly rounded return value *
   ***************************************************/

  /* deal with any special values that occurred.  See note above. */
  if (special_sum != 0.0) {
    if (Py_IS_NAN(inf_sum)) {
      PyErr_SetString(PyExc_ValueError,
        "-inf + inf in fsum");
      return NULL;
    }
    return PyFloat_FromDouble(special_sum);
  }

  /* strip zero limbs from top */
  ndigits = FSUM_ACC_SIZE;
  while (ndigits > 0 && acc[ndigits - 1] == 0)
    ndigits--;
  if (ndigits == 0)
    return PyFloat_FromDouble(0.0);

  /* sign of result == sign of topmost nonzero limb */
  sign = acc[ndigits - 1] > 0 ? 1 : -1;

  /* take absolute value of accumulator, renormalizing digits
    to lie in the range [0, FSUM_BASE) */
  carry = 0;
  for (i = 0; i < ndigits; i++) {
    /* FSUM_BASE/2-1 <= acc[i]*sign + carry <= FSUM_BASE/2 */
    digit = acc[i] * sign + carry;
    if (digit < 0) {
      acc[i] = digit + FSUM_BASE;
      carry = -1;
    } else {
      acc[i] = digit;
      carry = 0;
    }
  }
  assert(carry == 0);
  /* it's possible that the normalization leads to a zero top digit */
  if (acc[ndigits - 1] == 0)
    ndigits--;
  assert(ndigits > 0 && acc[ndigits - 1] != 0);

  /* Round acc * 2**FSUM_MIN_EXP to the nearest double. */

  /* 2**(top_exp-1) <= |sum| < 2**top_exp */
  top_exp = FSUM_MIN_EXP + FSUM_SIZE * (ndigits - 1);
  for (digit = acc[ndigits - 1]; digit != 0; digit /= 2)
    top_exp++;
  /* catch almost all overflows here */
  if (top_exp > DBL_MAX_EXP)
    goto overflow_error;

  m = 0.0;
  if (top_exp <= DBL_MIN_EXP) {
    /* no need for rounding */
    for (i = ndigits - 1; i >= 0; i--)
      m = FSUM_BASE * m + acc[i];
    return PyFloat_FromDouble((double) sign *
      ldexp(m, FSUM_MIN_EXP));
  }

  /* round: e is the position of the first bit to be rounded away. */
  e = top_exp - DBL_MIN_EXP - 1;
  half_eps = (long) 1 << (e % FSUM_SIZE);
  q = e / FSUM_SIZE;
  for (i = ndigits - 1; i > q; i--)
    m = FSUM_BASE * m + acc[i];
  digit = acc[q];
  m = FSUM_BASE * m + (digit & (FSUM_BASE - 2 * half_eps));
  if ((digit & half_eps) != 0) {
    round_up = 0;
    if ((digit & (3 * half_eps - 1)) != 0 ||
      (half_eps == FSUM_BASE / 2 && (acc[q + 1] & 1) != 0))
      round_up = 1;
    else
      for (i = q - 1; i >= 0; i--)
        if (acc[i] != 0) {
          round_up = 1;
          break;
        }
    if (round_up == 1) {
      m += 2 * half_eps;
      if (top_exp == DBL_MAX_EXP &&
        m == ldexp((double)(2 * half_eps), DBL_MANT_DIG))
        /* overflow corner case: result before
           rounding is within range, but rounded
           result overflows. */
        goto overflow_error;
    }
  }
  return PyFloat_FromDouble((double) sign *
    ldexp(m, FSUM_MIN_EXP + FSUM_SIZE * q));

  overflow_error:
    PyErr_SetString(PyExc_OverflowError,
      "fsum result too large to represent as a float");
  return NULL;
}

I also wrote code to convert doubles to/from BigInts (scaled up by 2**1023)

This actually scales up the double values by 21075, which is one factor of two too many — every resulting BigInt is even.
The smallest representable nonzero double is 2-1074 so it would be less confusing to scale up the values by 21074 unless for some reason you need every BigInt to be even.

I also don't understand this snippet:

  let binary = magnitude.toString(2);
  if (binary === 0) {…}

The if condition compares a string to a number using ===, which can never be true.

This actually scales up the double values by 21075, which is one factor of two too many — every resulting BigInt is even. The smallest representable nonzero double is 2-1074 so it would be less confusing to scale up the values by 21074 unless for some reason you need every BigInt to be even.

Sorry, yes, 21075. I'd previously left the last bit as 0 because it let me write 2n ** (BigInt(exponent)) * significand, which was easier for me to think about than having to correct the off-by-one, but I've tweaked it now in 5b69773.

I also don't understand this snippet:

Whoops, that's a leftover from an earlier version. You're right that it doesn't do anything now (and would be redundant with the binary.length <= 53 case immediately below it in any case). Removed.

I've simplified the main implementation following this code from the original Python issue.

The implementation I had earlier was conceptually quite close to Shewchuk's algorithm: it added x into each partial in order, culminating in the "overflow" partial, handling overflows by allowing x to be biased and doing arithmetic on potentially-biased values. This is very easy to reason about, but kind of annoying to implement.

But it doesn't have to work like that. Instead, when overflow occurs during one of the "x + a partial" steps, you can reduce that sum by 2 ** 1024 and add 2 ** 1024 into the "overflow" partial immediately. That preserves all of the relevant properties (partials are non-overlapping and ascending in magnitude, etc), and means you don't need to worry about arithmetic on biased values.

I'm trying to put together disparate sources to convince myself of the correctness of the theorems, specifically to make sure that the invariants assumed by a subsequent step actually match what the previous step claims to produce. Can you document the precise invariant that the partials are required to satisfy in the implementation? The details matter here.

The Python issue assumes that the invariant is that they're nonoverlapping and in increasing magnitude.

In addition to edge cases around overflows, some of what I'm looking at: "Nonoverlapping" does not mean that they're necessarily 53 bits apart; the binary values 11010000, 1000, 100, and 1.101 are nonoverlapping doubles, but some of the theorems in the papers require stronger input criteria which a sequence like that would violate. It's probably ok in this usage, but I need to check.

Quick links:

I've been relying on Theorem 10 of the Robust Arithmetic paper, which is fairly straightforward and which only needs only the assumption that the partials are finite, nonoverlapping, and increasing in magnitude (and implicitly that two-sum doesn't overflow). Including a single final "overflow" partial whose low bit is 21024 obviously preserves the finite, nonoverlapping, increasing properties, and we can handle overflow explicitly. The algorithm fails once the "overflow" partial loses precision, at which point we throw.

The later theorems are much more complicated and require other properties, but aren't actually necessary for this implementation.

From what I can tell the paper doesn't cover producing a single sum from the expansion once you've added all the values in, but I convinced myself of the correctness of the procedure used in msum (namely, add the partials in descending order of magnitude, stopping once that loses precision) as follows: Because the partials are descending in order of magnitude, once you've started losing precision you know that the lowest bit of the value you've just added is at least a factor of 253 smaller in magnitude than the high bit of the current value of the sum, and by the nonoverlapping property the sum of all the remaining partials is strictly smaller in magnitude than the lowest bit of the value you've just added, so adding in all the remaining partials can't affect the sum except in the case that it would affect rounding, which is handled explicitly.

Just to double-check: the invariant is that the partials are nonzero, finite, nonoverlapping, and increasing in magnitude? (Theorem 10 allows zero-valued partials in input and output, while crsum assumes that the partials are nonzero and can produce incorrect results if zero-valued partials exist).

Yes, though strictly speaking that's covered by "increasing in magnitude" (except for possibly the first partial, for which it doesn't matter if it's zero). Theorem 10 does not guarantee that you end up with a list of partials with the "nonzero" property, but that's easily achieved by discarding partials which are 0.

To be precise: I've been relying on a slightly modified version of Theorem 10, which looks like:

If you have an expansion e where all the partials are finite, nonoverlapping and increasing in magnitude, and the Two-Sum algorithm doesn't overflow, then applying the Grow-Expansion algorithm given e and b followed by a step which discards any partials which are 0 will produce a new expansion e' where all the partials are finite, nonoverlapping, and increasing in magnitude, and such that the mathematic sum of e' is equal to the mathematical sum of e and b.

Incidentally there's been some research into faster approaches since Shewchuk '96; the fastest I've found is described in this post and given in code here.

Per its claims there, on medium-size and larger lists of doubles (~1k+ elements), you can get exact summation with a 4x slowdown over naive summation by using 67 64-bit values (536 bytes). If you have a larger list (10k+ elements) and are willing to spend more memory, you can get exact summation with less than 2x slowdown over naive summation at the cost of using 33 Kb of memory.

Hi. I notice that my xsum algorithm and software is mentioned here. You may be interested that it's used by the xsum package for Julia (https://juliapackages.com/p/xsum).

You may also be interested in the following recent paper (which I haven't read in full detail, but which looks interesting):

Lange, M., Towards accurate and fast summation, https://web.archive.org/web/20220616031105id_/https://dl.acm.org/doi/pdf/10.1145/3544488

One advantage of superaccumulators, as used in my method, is that one can easily give the correct result when it's a finite double even if naive summation would produce overflow for an intermediate result. One just needs enough bits in the superaccumulator to handle summation of the largest possible double the maximum concievable number of times. In my code, I think I assume a maximum of 2^32 operands, but one can allow for more by just adding a few more bits to the superaccumulator.