/raw_pdb

A C++11 library for reading Microsoft Program DataBase PDB files

Primary LanguageC++BSD 2-Clause "Simplified" LicenseBSD-2-Clause

RawPDB

RawPDB is a C++11 library that directly reads Microsoft Program DataBase PDB files. The code is extracted almost directly from Live++ 2, a battle-tested hot-reload tool for C++.

Design

RawPDB gives you direct access to the stream data contained in a PDB file. It does not attempt to offer abstractions for iterating symbols, translation units, contributions, etc.

Building a high-level abstraction over the provided low-level data is an ill-fated attempt that can never really be performant for everybody, because different tools like debuggers, hot-reload tools (e.g. Live++), profilers (e.g. Superluminal), need to perform different queries against the stored data.

We therefore believe the best solution is to offer direct access to the underlying data, with applications bringing that data into their own structures.

Goal

Eventually, we want RawPDB to become the de-facto replacement of Microsoft's DIA SDK that most C++ developers (have to) use.

Features

  • Fast - RawPDB works directly with memory-mapped data, so only the data from the streams you touch affect performance. It is orders of magnitudes faster than the DIA SDK, and faster than comparable LLVM code
  • Scalable - RawPDB's API gives you access to individual streams that can all be read concurrently in a trivial fashion, since all returned data structures are immutable. There are no locks or waits anywhere inside the library
  • Lightweight - RawPDB is small and compiles in roughly 1 second
  • Allocation-friendly - RawPDB performs only a few allocations, and those can be overridden easily by changing the underlying macro
  • No STL - RawPDB does not need any STL containers or algorithms
  • No exceptions - RawPDB does not use exceptions
  • No RTTI - RawPDB does not need RTTI or use class hierarchies
  • High-quality code - RawPDB compiles clean under -Wall

Building

The code compiles clean under Visual Studio 2015, 2017, 2019, or 2022. A solution for Visual Studio 2019 is included.

Performance

Running the Symbols and Contributions examples on a 1GiB PDB yields the following output:

Opening PDB file C:\Development\llvm-project\build\tools\clang\unittests\Tooling\RelWithDebInfo\ToolingTests.pdb

Running example "Symbols"
| Reading image section stream
| ---> done in 0.066ms
| Reading module info stream
| ---> done in 0.562ms
| Reading symbol record stream
| ---> done in 25.185ms
| Reading public symbol stream
| ---> done in 1.133ms
| Storing public symbols
| ---> done in 46.171ms (212023 elements)
| Reading global symbol stream
| ---> done in 1.381ms
| Storing global symbols
| ---> done in 12.769ms (448957 elements)
| Storing symbols from modules
| ---> done in 145.849ms (2243 elements)
---> done in 233.694ms (539611 elements)
Opening PDB file C:\Development\llvm-project\build\tools\clang\unittests\Tooling\RelWithDebInfo\ToolingTests.pdb

Running example "Contributions"
| Reading image section stream
| ---> done in 0.066ms
| Reading module info stream
| ---> done in 0.594ms
| Reading section contribution stream
| ---> done in 9.839ms
| Storing contributions
| ---> done in 67.346ms (630924 elements)
| std::sort contributions
| ---> done in 19.218ms
---> done in 97.283ms
20 largest contributions:
1: 1896496 bytes from LLVMAMDGPUCodeGen.dir\RelWithDebInfo\AMDGPUInstructionSelector.obj
2: 1700720 bytes from LLVMHexagonCodeGen.dir\RelWithDebInfo\HexagonInstrInfo.obj
3: 1536470 bytes from LLVMRISCVCodeGen.dir\RelWithDebInfo\RISCVISelDAGToDAG.obj
4: 1441408 bytes from LLVMAArch64CodeGen.dir\RelWithDebInfo\AArch64InstructionSelector.obj
5: 1187048 bytes from LLVMRISCVCodeGen.dir\RelWithDebInfo\RISCVInstructionSelector.obj
6: 1026504 bytes from LLVMARMCodeGen.dir\RelWithDebInfo\ARMInstructionSelector.obj
7: 952080 bytes from LLVMAMDGPUDesc.dir\RelWithDebInfo\AMDGPUMCTargetDesc.obj
8: 849888 bytes from LLVMX86Desc.dir\RelWithDebInfo\X86MCTargetDesc.obj
9: 712176 bytes from LLVMHexagonCodeGen.dir\RelWithDebInfo\HexagonInstrInfo.obj
10: 679035 bytes from LLVMX86CodeGen.dir\RelWithDebInfo\X86ISelDAGToDAG.obj
11: 525174 bytes from LLVMAMDGPUDesc.dir\RelWithDebInfo\AMDGPUMCTargetDesc.obj
12: 523035 bytes from * Linker *
13: 519312 bytes from LLVMRISCVDesc.dir\RelWithDebInfo\RISCVMCTargetDesc.obj
14: 512496 bytes from LLVMVEDesc.dir\RelWithDebInfo\VEMCTargetDesc.obj
15: 498768 bytes from LLVMX86CodeGen.dir\RelWithDebInfo\X86InstructionSelector.obj
16: 483528 bytes from LLVMMipsCodeGen.dir\RelWithDebInfo\MipsInstructionSelector.obj
17: 449472 bytes from LLVMAMDGPUCodeGen.dir\RelWithDebInfo\AMDGPUISelDAGToDAG.obj
18: 444246 bytes from C:\Development\llvm-project\build\tools\clang\lib\Basic\obj.clangBasic.dir\RelWithDebInfo\DiagnosticIDs.obj
19: 371584 bytes from LLVMAArch64CodeGen.dir\RelWithDebInfo\AArch64ISelDAGToDAG.obj
20: 370272 bytes from LLVMNVPTXDesc.dir\RelWithDebInfo\NVPTXMCTargetDesc.obj

This is at least an order of magnitude faster than DIA, even though the example code is completely serial and uses std::vector, std::string, and std::sort, which are used for illustration purposes only.

When reading streams in a concurrent fashion, you will most likely be limited by the speed at which the OS can bring the data into your process.

Running the Lines example on a 1.37 GiB PDB yields the following output:

Opening PDB file C:\pdb-test-files\clang-debug.pdb
Version 20000404, signature 1658696914, age 1, GUID 563dd8f1-f32b-459b-8c2beae0e70bc19b

Running example "Lines"
| Reading image section stream
| ---> done in 0.313ms
| Reading module info stream
| ---> done in 0.403ms
| Reading names stream
| ---> done in 0.126ms
| Storing lines from modules
| ---> done in 306.720ms (1847 elements)
| std::sort sections
| ---> done in 103.090ms (4023680 elements)

Supported streams

RawPDB gives you access to the following PDB stream data:

  • DBI stream data

    • Public symbols
    • Global symbols
    • Modules
    • Module symbols
    • Module lines (C13 line information)
    • Image sections
    • Info stream
      • "/names" stream
    • Section contributions
    • Source files
  • IPI stream data

  • TPI stream data

Furthermore, PDBs linked using /DEBUG:FASTLINK are not supported. These PDBs do not contain much information, since private symbol information is distributed among object files and library files.

Documentation

If you are unfamiliar with the basic structure of a PDB file, the LLVM documentation serves as a good introduction.

Consult the example code to see how to read and parse the PDB streams.

Directory structure

  • bin: contains final binary output files (.exe and .pdb)
  • build: contains Visual Studio 2019 solution and project files
  • lib: contains the RawPDB library output files (.lib and .pdb)
  • src: contains the RawPDB source code, as well as example code
  • temp: contains intermediate build artefacts

Examples

A basic example that shows how to load symbols from public, global, and module streams.

Contributions (ExampleContributions.cpp)

A basic example that shows how to load contributions, sort them by size, and output the 20 largest ones along with the object file they originated from.

Function symbols (ExampleFunctionSymbols.cpp)

An example intended for profiler developers that shows how to enumerate all function symbols and retrieve or compute their code size.

Function variables (ExampleFunctionVariables.cpp)

An example intended for debugger developers that shows how to enumerate all function records needed for displaying function variables.

An example that shows to how to load line information for all modules.

An example that prints all type records.

An example that could serve as a starting point for people wanting to investigate and optimize the size of their PDBs.

Sponsoring or supporting RawPDB

We have chosen a very liberal license to let RawPDB be used in as many scenarios as possible, including commercial applications. If you would like to support its development, consider licensing Live++ instead. Not only do you give something back, but get a great productivity enhancement on top!