WebAssembly/stringref

Clamping versus trapping for invalid offsets

wingo opened this issue · 2 comments

wingo commented

stringview_wtf16.get_codeunit will trap on an invalid offset. However stringview_wtf16.encode or stringview_wtf16.slice will clamp operands within range. Should we make these more consistent?

One thing is that for stringview_wtf16.encode, there is a reasonable "null" answer that overlaps with the normal behavior: encoding nothing. Similar with slices. Whereas for stringview_wtf16.get_codeunit I guess you would define that -1 would be the exceptional answer, which, maybe it could "leak" to other parts of your program, e.g. if you mindlessly store it via i32.store16 to write U+FFFF. Also, the compiler can use the fact that out-of-range would throw to make an inference about the string length. But, these are relatively minor concerns, I think.

In general I have the impression that being able to tell, from an operation, whether an access was performed out of bounds would provide maximum utility and perhaps even aid efficiency and code size.

To give an example on utility: If stringview_wtf16.get_codeunit is performed in bounds, it could return the code unit in range [0, 65535] as a 32-bit integer, while if the access is out of bounds, it could return a sentinel -1 (as of the outlined "null" answer). This way, the producer can decide what do with the OOB access (e.g. producing a RangeError) without performing any prior checks, which would potentially add to code.

And doing so might aid efficiency, in that the instruction has to perform a bounds check anyway to be safe, so if there was an additional check around the instruction generated by the producer (say to avoid a trap), one check would be redundant. Now the expectation might be that a sufficiently smart VM can eliminate redundant bounds checks, but perhaps it's even better if there's no magic necessary?

Now there might be instructions where a simple sentinel cannot be used, which would likely motivate a solution with multiple return values instead, which again might motivate a unified solution for all instructions producing additional values from side-effects. But I digress. Anyway, it seems to me that trapping is undesirable most of the time because there's nothing useful a producer can do with a trap, and the most use lies in being able to tell, after the fact, what should be done in response :)

@dcodeIO : good point.

I came here to say that I agree with the dangers of returning -1 from get_codeunit (which calling code might forget to check for), so in consequence, if we wanted consistency, that should mean to make all instructions trap on invalid parameters.

But the argument that traps are hard to deal with (modules would want to avoid them, so they'd have to emit manual checks before any trapping instruction, which often means duplicated work unless the engine is sufficiently clever) does hold a lot of water: fallible operations are good.

Maybe some of it comes down to seeing a representatively large number of concrete example use cases of the various instructions. For instance, we might find that get_codeunit is nearly always used in a loop (over some part or all of the stringview), where the loop's end condition is a bounds-checked index anyway (think for (i = 0; i < string.length; i++)). In that case, it wouldn't matter whether get_codeunit traps or returns a sentinel on OOB, because OOB won't happen anyway.

Also, if we see that different instructions are usually used in different contexts (wrt. their inputs being naturally pre-checked or not), we may decide that making them each behave in a way that maximizes utility for their primary use case is better than aiming for any form of "consistency".