MolecularMatters/raw_pdb

What is the equivalent of IDiaSymbol::get_length?

pierricgimmig opened this issue ยท 11 comments

To get the length in bytes of a function from its corresponding DIA symbol, we can use the get_length accessor. How could we retrieve the same information in raw_pdb?

Since you specifically asked about function symbols, I guess this is for Optick, so I assume you are only interested in function symbols per se.

The fastest way I know of how to do that would be the following:

  1. Walk the module symbol streams first and fetch everything that is a function. Keep track of which RVAs you already found, this is neeed for later. The size of the function is stored in any of the S_*PROC.codeSize members.
  2. At this point, you already know ~90% of all function symbols and are done. However, with stripped PDBs or certain PDBs from middleware providers, there will be public symbols that are not stored in any of the module streams.
  3. Walk the public symbol stream, ignoring anything that is not a function. This can be done with a simple bit-check against S_PUB32.flags.
  4. For each public symbol that is a function, consult the previously stored table of RVAs from step 1. If this is a new symbol, you still need to get its size. This can be done by computing the distance between this and the next function symbol.

Since function symbols & sizes are what most profilers are interested in, I will provide an example for that.

I added an example that demonstrates how to do this.
Let me know if that works for you.

Thanks a lot for the detailed example! I'll try it out today and let you know how it goes.

Since you specifically asked about function symbols, I guess this is for Optick

It's for Orbit :-)

We already have two pdb parser implementations, one using LLVM and the other one using the DIA SDK. We're interested in seeing how raw_pdb compares.

It's for Orbit :-)

Ah yes, I always mix up Orbit and Optick :).
I know Orbit from when you first presented it, it seems to have grown a lot during the last years!

Quick update on this. I integrated the sample code you provided and it seems to work as expected. The raw_pdb implementation is much faster than its DIA counterpart. It also found more symbols than both our DIA and LLVM versions, I need to dig a bit more to understand exactly what the difference is. I did notice quite a bit of "ILT" symbols which I'm discarding, but I still have more symbols than before, which is good but surprising. It's probably the "public symbols that are not stored in any of the module streams". Many thanks @MolecularMatters for your help and for the great project!

I did notice quite a bit of "ILT" symbols which I'm discarding, but I still have more symbols than before, which is good but surprising.

The ILT symbols are "incremental linking thunks" stored in the "* Linker *" module stream. They are produced when compiling with /INCREMENTAL, and are 5-byte jmp thunks. DIA will also return them during enumeration when using SymTagThunk. Are you missing those in your DIA implementation?

It's probably the "public symbols that are not stored in any of the module streams".

Most likely not, because DIA will also return those symbols when enumerating with SymTagPublicSymbol. However, what you might be missing is the fact that with DIA, you have to recurse into returned IDiaSymbol* with ::findChildren. If you don't do that, you will certainly be missing symbols.

Internally, in the module streams, there are symbols which open a scope (e.g. S_LPROC32) and others which close a scope (e.g. S_END). DIA seems to follow this parent-child relationship when enumerating symbols, hence you have to ask for children of returned symbols as well.

In order to find all function symbols in a PDB using DIA, you have to:

  • enumerate all SymTagPublicSymbol
  • enumerate all SymTagFunction and SymTagBlock (!)
  • for each SymTagFunction and SymTagBlock, use findChildren(), recursively, until there are no more symbols returned

Once you do that in DIA, you should be able to get the same number of symbols, but the performance gap between DIA and raw_pdb will become even bigger.

The raw_pdb implementation is much faster than its DIA counterpart.

It would be great if you could provide some numbers for comparison, once you figure out which symbols you are missing in DIA.

Again, thanks @MolecularMatters for the detailed answer. This is great information, I'll double check our Llvm and Dia implementations.

It would be great if you could provide some numbers for comparison, once you figure out which symbols you are missing in DIA.

Absolutely!!

However, what you might be missing is the fact that with DIA, you have to recurse into returned IDiaSymbol* with ::findChildren. If you don't do that, you will certainly be missing symbols.

Is this only true when iterating over the different compilads/modules and their children (with the filter for SymTagFunction), or do I also need to do take care of this when getting all children (that have SymTagFunction) from the global scope, right away?

In my experiments, the results were the same.

enumerate all SymTagFunction and SymTagBlock (!)

Also, as far as I understand the documentation, blocks should usually not have a name, and there should be a function surrounding them, right? So for Orbit, it would be fine to ignore those.

Is this only true when iterating over the different compilads/modules and their children (with the filter for SymTagFunction), or do I also need to do take care of this when getting all children (that have SymTagFunction) from the global scope, right away?

In my experiments, the results were the same.

If I remember correctly, Clang likes to store data symbols as children of the global scope sometimes (e,g, function static variables), which MSVC never does.

Also, as far as I understand the documentation, blocks should usually not have a name, and there should be a function surrounding them, right? So for Orbit, it would be fine to ignore those.

I think that is mostly true, but I encountered blocks that don't seem to belong to any other function, and had to be matched against address ranges from other function symbols. That only seemed to be the case for certain kernel symbol PDBs though.
Maybe @rovarma from Superluminal can comment, since I believe he also ran into this.

If I remember correctly, Clang likes to store data symbols as children of the global scope sometimes (e,g, function static variables), which MSVC never does.

As I was only looking into function symbols and not data symbols, that seem to be fine. Thanks for the explanation!

I think that is mostly true, but I encountered blocks that don't seem to belong to any other function, and had to be matched against address ranges from other function symbols. That only seemed to be the case for certain kernel symbol PDBs though.

Thanks for the clarification. If you remember the pdbs in question, that would be great.