Do source maps represent original location or current location?
Opened this issue · 2 comments
In JS, source maps are meant to help debugging by connecting the original source locations to the (converted/minified) JS locations, so they contain original user programs’ source locations and functions.
But our source maps for Wasm files are currently in an ambiguous state. For example, for inlined functions, our source maps contain not the original function/line/column but the function/line/column that the original location is inlined into. Because we create source maps using DWARF information, we can in theory do better by parsing all inlined callsite information. Source maps don’t have the capability of storing multiple inlined locations for a single location, but at least it can store the most original, before-inliing location. But that would take more efforts and require a full parsing of llvm-dwarfdump results, making #9580 unusable.
I’m thinking of adding support for names field with function names (#25044), and this is the issue here again. It was suggested that I use the name section (instead of the full llvm-dwarfdump) info, which makes sense, but this means the function names/indices the names field and what the mappings field will contain are not the original function names but the current function names. So if inlining or outlining happened, the function names will reflect the final inlined/outlined state, not the original function names.
This has some advantages. This is a lot easier to generate. With this maybe we can replace the name section with the source map’s names field, because they will be the same. But it may not be the most helpful thing for debugging, and the meaning of source maps will be different from that of JS, where source maps contain the original source information.
Currently in Binaryen we don’t update the names field, which is not the worst thing in the world because at least it preserves input function names that each Expression is associated with: https://github.com/WebAssembly/binaryen/blob/600ccd0d3892670648c274ef24be7c673b8854f5/src/wasm/wasm-binary.cpp#L1268-L1272
We can do better; we can remove function names deleted after optimizations, which will in some cases significantly reduce the size of the names field.
But actually this is a weird middle ground already, because the names field we read from the input file wouldn’t be the original function names anyway, because they are generated from the LLVM’s optimized output, unless we generate them using DWARF’s full inlining information.
If we want to be simpler and just reflect the final status of Wasm functions, we don’t even need to read the names field and write that back with the output. We can just generate the names field from scratch using the final status of the optimized Wasm file.
What do you think the source maps should contain?
To sum up,
- Source maps contain the final Wasm function/line/column information
- Pros: Easy to generate, and
namesfield can serve as the name section - Cons: May not be helpful for debugging when inlining/outlining occurs
- Source maps contain the original source’s function/line/column information
- Pros: Helpful for debugging. Consistent with the original meaning of source map for JS
- Cons: Hard to generate. Need to read the full DWARF.
If we decide to use the name section to generate DWARF in #25044, that means we choose 1. And in Binaryen we wouldn't even need to read names field from the input; we should just regenerate it from the output.
But our source maps for Wasm files are currently in an ambiguous state. For example, for inlined functions, our source maps contain not the original function/line/column but the function/line/column that the original location is inlined into.
At least the line info seems to be ok? Or maybe I am not measuring it right. Here is what I did:
// a.cpp
int test(int x) {
return x + x * 1337 + (x % 12345678) + (x & 999);
}
int main(int argc, char **argv) {
return test(argc) + test(argc + 1);
}LLVM inlines test when doing emcc a.cpp -gsource-map -O1 (I also tested -Os to see wasm-opt doesn't mess things up). Then wasm-opt a.out.wasm -ism a.out.wasm.map --print shows
(func $main (param $0 i32) (param $1 i32) (result i32)
(local $2 i32)
;;@ a.cpp:7:20
(i32.add
;;@ a.cpp:3:39
(i32.add
;;@ a.cpp:3:22
(i32.add
;;@ a.cpp:3:39
(i32.add
;;@ a.cpp:3:44
(i32.and
;;@ a.cpp:6:0
(local.get $0)
;;@ a.cpp:6:0
(i32.const 999)
)So a bunch of that code is marked as coming from line 3, the original location before inlining, as expected. Some seems wrong, like the 999 has 6 and not 3... not sure why that is, but maybe those are minor LLVM inaccuracies with tracking locations during optimization?
after reading the examples at #20462 (comment) it just occurred to me that if we always encode the source for a piece of code as the inlined function, we can maybe infer inlining information in emsymbolizer, even if it's not explicitly encoded in the source map. For example given an address, we figure out which function index (in the compiled file) it's in, find that function's name (perhaps by finding the first name entry starting after the function's starting address) and then compare that to the name encoded in the sourcemap. If they are different, the address has been inlined. Perhaps we could handle multiple levels of inlining by walking the function from the start and tracking how many nestings we see. That nesting would be imperfect because code gets moved around, you could have 2 levels of inlining getting "popped" at once, a function could get inlined multiple times into the same outer function, and probably other complications, so maybe it wouldn't work very well. But even if we guessed at just one level of inlining (and only had names rather than line numbers for the inline site), it might still be an improvement over what we have now, and it wouldn't need any proprietary encoding of inline info in sourcemaps, or any rewrites of Binaryen's DWARF support.