parsingdata/metal

Potential stack overflow with ConcatenatedValueSource

Opened this issue ยท 4 comments

While working on a specific token, I triggered a stack overflow due to a deep recursion with a concatenated value source. The problem is the following piece from its implementation:

private Trampoline<byte[]> getData(final ImmutableList<Value> values, final BigInteger currentOffset, final BigInteger currentDest, final BigInteger offset, final BigInteger length, final byte[] output) {
    if (length.compareTo(ZERO) <= 0) {
        return complete(() -> output);
    }
    if (currentOffset.add(values.head.slice.length).compareTo(offset) <= 0) {
        return intermediate(() -> getData(values.tail, currentOffset.add(values.head.slice.length), currentDest, offset, length, output));
    }
    final BigInteger localOffset = offset.subtract(currentOffset).compareTo(ZERO) < 0 ? ZERO : offset.subtract(currentOffset);
    final BigInteger toCopy = length.compareTo(values.head.slice.length.subtract(localOffset)) > 0 ? values.head.slice.length.subtract(localOffset) : length;
    System.arraycopy(values.head.slice.getData(), localOffset.intValueExact(), output, currentDest.intValueExact(), toCopy.intValueExact());
    return intermediate(() -> getData(values.tail, currentOffset.add(values.head.slice.length), currentDest.add(toCopy), offset, length.subtract(toCopy), output));
}

In the second to last line, a call to getData is done. This may very well be another concatenated value source. The problem then is that this function is not tail recursive. I made a test for proof, together with a quick fix so I could use it locally in my branch https://github.com/ccreeten/metal/tree/concatenated-overflow

Now, I am thinking this problem may appear in other places as well. For example, the Seq implementation parses the head, and recursively calls the iterate for the tail. However, if I am not mistaken, we can again trigger an overflow by having a big recursive definition with sequences in the head side.

Of course, this is probably never going to become a problem. The concatenated value source happened with a real world example though. It can probably be solved by creating a different token implementation however.

jvdb commented

Interesting find! I haven't looked at this in-depth yet, but I would think that the values.head.slice.getData() call should just be wrapped in a trampoline?

With regards to implementations such as Seq, I think it was a conscious decision that we don't expect definitions to be so big that they will cause a stack overflow. At the same time, if we start generating/deriving token definitions, anything is possible...

I think it would not solve the problem, because it is not tail recursive. The way I see it, with a tail recursive trampoline implementation, a call to a trampoline pushes a new stack frame, and a return of a trampoline then pops it. When you call another trampoline inside this function, you are already pushing a next frame, and it being a recursive function, will make this happen again and again. At least, that is what I think ๐Ÿ˜„

With tokens, it seems like a reasonable trade-off. The most likely way I would see it happen with a manually created token would be with a TokenRef I would think.

jvdb commented

I will debug your example to see if I can come up with a good solution! But I agree the non-tail-recursiveness is a deal-breaker for just wrapping in a trampoline.

Note: if you get a java heap space error when using fold(..., Shorthand::cat), then use cat(...) instead, that uses FoldCat which is an optimized version and will create a ConcatenatedValueSource with a single long list.