microsoft/CLRInstrumentationEngine

Raw profiler hook doesn't work with a profiler that adds methods

robertpi opened this issue · 22 comments

I'm testing my Instrumentation Method by running it along side a profiler provided by a popular APM vendor. The program (a web server) fails to start with the following error:

Unhandled exception. System.TypeLoadException: Could not load type '__DDVoidMethodType__' from assembly 'CoreAspnetMvcWebApp, Version=0.2.7.0, Culture=neutral, PublicKeyToken=null' because the method '__DDVoidMethodCall__' has no implementation (no RVA).

I think this is because MicrosoftInstrumentationEngine::CCorProfilerInfoWrapper::SetILFunctionBody fails to find a CMethodInfo for the newly added method. I've tried to add fix this by calling m_pRealCorProfilerInfo->SetILFunctionBody, under the assumption instrumentation methods won't need see the newly added method, but this does not seem to work. Here is the fix I was testing: robertpi@22b96b5#diff-4ddd686c604ac947d1ae58bc2b3cdedd0b15be84d9432579d6b1f6abc1acd4bcR474

However, the IL seems to get corrupted in some way when I do this, because now when I run the program I see one of these two errors:

Unhandled exception. System.InvalidProgramException: Common Language Runtime detected an invalid program.
    at __DDVoidMethodType__.__DDVoidMethodCall__()

or

Unhandled exception. System.BadImageFormatException: Bad IL format.
    at __DDVoidMethodType__.__DDVoidMethodCall__()

So, I think that you are running up against a problem that has been an issue since the beginning of ICorProfiler, and is difficult to work around. Maybe we should supply a better API for it.

The problem is that all functions that are added to a module need to have a valid RVA. You can't set it to null when you create it. That won't work. The CLR will barf.

There is a bit of a chicken/egg problem. You are supposed to get an RVA for a new method using IMethodMalloc::Alloc, which requires that you know the size of the method, which means that you need to have already generated all of the IL for the method. In order to do that, you typically need a method token to associate with the IL.

To get a new method token for the method, you use IMetaDataEmit::DefineMethod. But, it requires an RVA, which can't be null.

Using IMethodMalloc naturally leads to memory leaks because the you can't deallocate the memory for the original function body, and it will likely be replaced later anyway. To avoid this, CLRIE doesn't give you a real copy of IMethodMalloc when you ask for it. Instead, it give you the ability to allocate memory, but it will not have a valid RVA inside the CLR.

So, if you are using IMethodMalloc, that might be the root of the problem here. The IMethodMalloc that is being returned isn't giving a valid RVA, and that could lead to the bad image format or invalid program.

The way we typically get around this is to lie IMetaDataEmit::DefineMethod, and give it an RVA for a random method in the module. Then, when we get the instrumentation callback, we just generate the method body and replace the method. This is fine, because the original RVA will never be used to actually execute any code. The RVA of the method will be replaced after the JIT with the newly allocated memory.

It may be necessary to rethink this design, and actually give clients a real copy of IMethodMalloc when asked for since it can be a source of bugs. It won't paint anyone into any corners because it is still possible to not use IMethodMalloc before calling IMetaDataEmit::DefineMethod.

What you're saying is a bit disappointing to me, I spent time creating an instrumentation method because of the raw profiler hook. It's really interesting to be able to create application instrumentation that runs alongside an existing profiler, that is created by a third party company and that you have no control over. From what you're saying, this doesn't look realistic, a profiler that works correctly independently can become buggy when run under the raw profiler. If you don't control the code base of that profiler, then it's going to be a real headache.

For reasons beyond my control, this isn't particularly an issue for me anymore, but I'm trying to be a good citizen and report this issue for people that might be in the same position in the future :)

So, I've checked the Datadog code base that I've been testing against they do cal IMethodMalloc::Alloc and appear to use it correctly conjunction with the ICorProfilerInfo::SetILFunctionBody. When they call DefineMethod they appear to pass 0 as there RVA:

  hr = metadata_emit->DefineMethod(new_type_def,
                              "__DDVoidMethodCall__"_W.c_str(),
                              mdStatic,
                              initialize_signature,
                              sizeof(initialize_signature),
                              0,
                              0,
                              ret_method_token);

This seems to work just fine when the profiler used independently, but I guess because of the issue you describe above it does not work when used via the raw profiler hook.

I guess this can help us dig into this specific issue, and as Datadog profiler is open source they might accept a PR to help make them more compatible with CLRIE.

This indicates to me that we should be returning a real instance of IMethodMalloc, if we can. That should solve the issue with datadog. I'm a little surprised that DefineMethod is accepting a null rva, and not causing problems, though.

For what it is worth, we really want to make sure that we can be compatible with as many profilers as possible. So, if this is causing trouble with Datadog, it is definitely something we want to address.

Maybe RVA being zero is a red herring, I wasn't sure what RVA was, so I checked the specs:

The location of the method body in the current module, as an address relative to the start of the module file in which it is located (a Relative Virtual Address, or RVA). Or, alternatively, the RVA is encoded as 0 and other metadata is used to tell the infrastructure where the method implementation will be found, including:
o A method implementation to be located by implementation -specific means described outside this Standard.
o Forwarding calls through an imported global static method.

I guess that means the IMethodMalloc::Alloc wrapper is mostly like to be causing the issue. It's difficult to see how I could provide a repo, since DD don't distribute their profiler as nuget package, making adding dependency tricky. I'm happy to test locally if you provide a branch with a fix, not sure I have enough knowledge of CLRIE to make the suggested fix myself.

@wiktork @jander-msft @WilliamXieMSFT I would like at least one other person to chime in on this because I remember having a discussion about IMethodMalloc a couple of years ago, and there were varying opinions.

@robertpi if I pushed a branch with the fix in it, would you be able to build and test that locally, before we merged it?

Sure, no problem at all.

cc @davmason

Some initial notes from our discussions so far:

  • The real IMethodAlloc just performs a simple allocation, so the issue is likely not in the RVA requirements but rather the fact that we deallocate this memory. It's possible that datadog is tracking their rva's and using them later.
  • The biggest issue I can think of when using the real IMethodAlloc is that we will leak this memory in cases where both a raw profiler and an instrumentation method need to alter the IL.

@robertpi As we work through this issue, is the root of your specific problem that you are attempting to generate a new function during a class or module load callback, and the SetILFunctionBody call is failing?

For that specific problem, we might want to actually add a new API for generating new methods. However, I still suspect that IMethodMalloc is an issue.

@delmyers @wiktork the RVA issue shouldn't be causing issues. The runtime profiler tests call IMetadataEmit::DefineMethod with 0 for RVA and they work consistently, and I've seen multiple other profilers (CLRProfiler, DataDog, etc) use a dummy RVA. Either one should work. The only thing I can remember off the top of my head the RVA is used for by the runtime is to look up IL in the assembly, but methods you are adding should always have IL defined with SetILFunctionBody so we should never go down that path. Inlining has a quirk that it will ignore IL set by SetILFunctionBody, so it is possible if you define a method with a 0 RVA and then let it be inlined somewhere you could see issues. I don't necessarily think that's what actually is happening here, just thinking through all the possibilities.

In reasonably recent runtimes on desktop and core IMethodMalloc will just new up some memory and return it, you are free to new it up yourself if you want.

There is a gotcha between ICorProfilerInfo::SetILFunctionBody and ICorProfilerFunctionControl::SetILFunctionBody though. The ReJIT ICorProfilerFunctionControl one will allocate its own buffer and copy your IL, so you can delete the IL buffer afterwards. The old style ICorProfilerInfo one will not and store the pointer passed in. If you are freeing the IL buffer after calling ICorProfilerInfo::SetILFunctionBody that could be the issue.

Thanks for the clarifications David. The changes we are looking at right now:

@delmyers To clarify the issue I'm having:

  • I have created a Instrumentation Method based on the "Nagler", which rewrites certain IL methods to insert instrumentation calls. This appears to work well independently.
  • I have tested my Instrumentation Method with Datadog profiler attached on the raw profiler hook. The application fails to load, with the error mentioned above.
  • From the logs it appears to the Datadog profiler that is crashing the application, but I have confirmed that Datadog profiler works correctly with test application when attached directly (as one would except)

From my interpretation of the logs generated by CLRIE, the calls to SetILFunctionBody from Datadog profiler doesn't return an error code, but the body doesn't get set either. This branch contains some attempts to fix this, but they don't solve the issue: robertpi@22b96b5

My test application crashes before my Instrumentation Method attempts any rewrites, so don't know if that would work in conjunction with Datadog profiler.

One more thought, although it's tricky for me to give you my the code / scripts for how I reproduce this issue, I think it should be simple to create your own:

  • Install Datadog for .NET
  • Create standard ASP.NET Core website
  • Attach CLRIE as a profiler with DD profiler attached to raw hook

I don't think an Instrumentation method would be needed.

@robertpi I was able to repro the issue and I have a tentative fix. Would you be able to try it out? I've tried it with the data dog tracer, but I'm not sure it covers all the scenarios. The fix is at https://github.com/wiktork/CLRInstrumentationEngine/tree/dev/wiktork/customBuffer

Hi @wiktork , I think what you've done helps, but my test web server still crashes, before it starts. However, I do see it now loads some of the DD manged dlls, so it's getting considerable further than it did without this change. I've had a look though the app logs and there's not clear reason why it crashes, I just see DD was in the middle of a rewrite. I don't think my instrumentation method preformed any rewrites, as I don't see any logs from it. There are the logs:

dotnet-tracer-loader-dotnet-19300.log
dotnet-tracer-managed-dotnet-20210224.log
dotnet-tracer-native.log
instrumentation-method-logs.zip

The instrumentation method logs are prefixed "[IM:B98771FE-2159-4143-A96C-95715667AA0D]"

I'm going to strip back what I'm do to try and create a minimal reproduction of this issue, then push that to a new repo.

@robertpi Please let us know if you are still seeing the failure with your instrumentation method + data dog + my current fixes. Thanks!

any update?

I placed the DataDog .NET tracer 1.28.8 in the raw profiler hook on .NET framework and observed a bug similar to what the OP wrote at the top (the RVA issue). The CLRIE version was 1.0.42.
Then I built from source code branch dev/wiktork/customBuffer2 in repo https://github.com/wiktork/CLRInstrumentationEngine
.
The resulting MicrosoftInstrumentationEngine_x64.dll did not cause any problems in my tests.
Datadog events continued to flow correctly to the remote collector for graphing etc, and our onboarded profiler worked correctly too.
Therefore I am requesting @wiktork to apply the fixes from branch customBuffer2 to the main CLRIE repo and cut a tag.
This is a big deal and sooner would be better, thanks.

@sanikolov Appreciate the test confirmation. Given that this change is a bit large and impactful, we'll have to take some time to review it fully. Could you provide us your timeline/when you would need this change? We are planning to guard the changes behind an environment flag to minimize introducing issues and help expedite the release.

see email I just sent.