rustwasm/twiggy

Twiggy doesn't account for 100% of a binary's bytes

Closed this issue · 5 comments

Currently twiggy does an excellent job at profiling things like function sizes and such, but its accounting doesn't always add up to 100% of the bytes in a file. For example on WebAssembly there's an 8 byte header on all wasm files unaccounted for, section ids and the leb128 to say how big a section is is also not accounted for.

It'd be neat if internally we could refactor this to perhaps have overlapping nodes in a file. For example I'd say that the code section (leading indicator byte and leb128 size and all) is 10 bytes large. Two functions inside there would then be, perhaps, 5 and 3 bytes large. That means that the code section alone is 2 bytes large, and 8/10 other bytes are accounted for in elements of the section.

I'm not sure that the internals of twiggy though are quite ready for this sort of nesting relationship, but it'd be neat to account for 100% of bytes

perhaps have overlapping nodes in a file.

Anything shared should be its own node in the graph with multiple incoming edges (which is what it means to be shared)

I should also mention that I haven't actually tested this out yet so it may already work! I can try to give this a stab sometime soon

Ok so I've tested this out and I'm not sure this quite works the way I'm thinking, but @fitzgen perhaps you can double check my thinking?

I'm imagining something like:

let id = Id::section(idx);
items.add_root(Item::new(id, name, full_section_size, ir::Misc::new()));

for (i, each_entry_in_section) in iter {
    let id = Id::entry(idx, i);
    items.add_root(Item::new(id, name, entry_size, ir::Misc::new()));
}

Doing that locally I get things that look like:

 Shallow Bytes │ Shallow % │ Item
───────────────┼───────────┼───────────────────────────────────
           556 ┊    71.28% ┊ data section headers
           529 ┊    67.82% ┊ data[0]
           104 ┊    13.33% ┊ custom section 'producers'
           104 ┊    13.33% ┊ custom section 'producers' headers
            26 ┊     3.33% ┊ data[1]
            16 ┊     2.05% ┊ export section headers
            12 ┊     1.54% ┊ "function names" subsection
            12 ┊     1.54% ┊ custom section 'name' headers
             9 ┊     1.15% ┊ export "memory"
             9 ┊     1.15% ┊ code section headers
             8 ┊     1.03% ┊ add
             7 ┊     0.90% ┊ type section headers
             6 ┊     0.77% ┊ type[0]: (i32, i32) -> i32
             6 ┊     0.77% ┊ export "add"
             3 ┊     0.38% ┊ memory section headers
             2 ┊     0.26% ┊ function section headers
             2 ┊     0.26% ┊ memory[0]
             1 ┊     0.13% ┊ func[0]
          1412 ┊   181.03% ┊ Σ [18 Total Rows]

which looks like data is being double counted, because the sizes are overlapping. What I'm ideally looking for is that the size for data section headers is sizeof(data section headers) - sizeof(data[0]) - sizeof(data[1])

What I'm ideally looking for is that the size for data section headers is sizeof(data section headers) - sizeof(data[0]) - sizeof(data[1])

If we kept track of which section every item was from, we could do this computation somewhat directly.

But in general, the nested intervals of memory approach that you seem to be conceptualizing is a bit different from the graph-based approach that twiggy takes right now. And I think we can certainly improve what we have and bridge the gap, but I also think we should be deliberate and careful not to accidentally stumble into a neither-tree-nor-graph representation. Like, I wouldn't want to end up with some analyses assuming a tree with overlapping size counts and another analysis assuming non-overlapping nodes and expecting their sum to be 100%.

Perhaps if we changed the id/node representation to be more tree/interval based and explicitly represented the shared vs unshared sizes?

I think that'd work yeah, but after some more thought I think we may be able to implement this, at least for wasm, in a simpler fashion