dotnet/diagnostics

[Test] Microsoft.Diagnostics.Monitoring.EventPipe.UnitTests.EventCounterTriggerWithEventPipePipelineTest fails intermittently

mikem8361 opened this issue · 2 comments

Re-enable this test after it is fixed.

Failure:

Microsoft.Diagnostics.Monitoring.EventPipe.UnitTests.EventCounterTriggerTests.EventCounterTriggerWithEventPipePipelineTest(config: projectk.sdk.prebuilt.6.0.30) 
      Duration: 2 min

      Message: 
System.Threading.Tasks.TaskCanceledException : A task was canceled.

      Stack Trace: 
<<ExecutePipelineWithTracee>b__0>d.MoveNext() line 86
--- End of stack trace from previous location ---
TestRunnerUtilities.ExecuteCollection(Task collectionTask, TestRunner testRunner, CancellationToken token, Func`2 waitForPipeline) line 47
PipelineTestUtilities.ExecutePipelineWithTracee[TPipeline](TPipeline pipeline, Func`3 startPipelineAsync, TestRunner testRunner, CancellationToken token, TaskCompletionSource`1 waitTaskSource) line 93
PipelineTestUtilities.ExecutePipelineWithTracee[T](EventSourcePipeline`1 pipeline, TestRunner testRunner, TaskCompletionSource`1 waitTaskSource) line 45
EventCounterTriggerTests.EventCounterTriggerWithEventPipePipelineTest(TestConfiguration config) line 359
EventCounterTriggerTests.EventCounterTriggerWithEventPipePipelineTest(TestConfiguration config) line 364
EventCounterTriggerTests.EventCounterTriggerWithEventPipePipelineTest(TestConfiguration config) line 365
--- End of stack trace from previous location ---

      Standard Output: 
[7/10/2024 4:25:49 PM] Test runner created
Running Process: C:\ssd\diagnostics\.dotnet-test\dotnet.exe --fx-version 6.0.30 C:\ssd\diagnostics\artifacts\bin\EventPipeTracee\Debug\net6.0\EventPipeTracee.dll 3d3f0f7b-dfab-4f64-91f4-62cc00ab3a2d TriggerRemoteTest SpinWait10
Working Directory: 
Additional Environment Variables: DOTNET_MULTILEVEL_LOOKUP=0, DOTNET_ROOT=C:\ssd\diagnostics\.dotnet-test
{
    [00:00.005] 5584 Successfully started process
    [00:00.005] 5584 WaitForTracee
    00:00.088: 5584 EventPipeTracee: DiagMetrics False
    00:00.088: 5584 EventPipeTracee: start process
    00:00.088: 5584 EventPipeTracee: connecting to pipe
    [00:00.093] 5584 WaitForTracee: DONE
    00:00.090: 5584 EventPipeTracee: connected to pipe
    [00:00.099] 5584 WakeupTracee
    00:00.138: 5584 EventPipeTracee: 7/10/2024 11:25:50 PM Awaiting start
    [00:00.144] 5584 WakeupTracee DONE
    [00:00.144] 5584 WaitForSignal
    00:00.139: 5584 7/10/2024 11:25:50 PM Starting test body '42'
    00:00.147: 5584 EventPipeTracee: signal end of test data
    [00:00.151] 5584 WaitForSignal DONE 31
    00:00.489: 5584 Spin waiting...
    00:00.824: 5584 Spin waiting...
    00:01.157: 5584 Spin waiting...
    00:01.495: 5584 Spin waiting...
    00:01.833: 5584 Spin waiting...
    00:02.173: 5584 Spin waiting...
    00:02.571: 5584 Spin waiting...
    00:03.103: 5584 Spin waiting...
    00:03.562: 5584 Spin waiting...
    00:03.948: 5584 Spin waiting...
    00:04.350: 5584 Spin waiting...
    00:04.759: 5584 Spin waiting...
    00:05.119: 5584 Spin waiting...
    00:05.479: 5584 Spin waiting...
    00:05.823: 5584 Spin waiting...
    00:06.165: 5584 Spin waiting...
    00:06.501: 5584 Spin waiting...
    00:06.836: 5584 Spin waiting...
    00:07.176: 5584 Spin waiting...
    00:07.510: 5584 Spin waiting...
    00:07.848: 5584 Spin waiting...
    00:08.182: 5584 Spin waiting...
    00:08.515: 5584 Spin waiting...
    00:08.848: 5584 Spin waiting...
    00:09.181: 5584 Spin waiting...
    00:09.515: 5584 Spin waiting...
    00:09.848: 5584 Spin waiting...
    00:10.147: 5584 7/10/2024 11:26:00 PM Awaiting end
}
Process 5584 exited -1 (02:00.034 elapsed)

Killing process 5584: 02:00.039 - Process timed out
    [02:00.107] 5584 Did not receive completion signal before cancellation.
    [02:00.107] 5584 WakeupTracee
    [02:00.108] 5584 WakeupTracee DONE
    [02:00.108] 5584 Disposing
    [02:00.108] 5584 WakeupTracee
    [02:00.108] 5584 WakeupTracee DONE
    [02:00.108] 5584 WaitForExitAsync
    [02:00.108] 5584 WaitForExitAsync DONE

This test timeout reproduces fairly consistently (maybe only a flakey success once in a while).


Attempts to find the culprit change

Since this test was noted to have started failing while #4743 was bumping the runtime version, I first started to use a locally built runtime based on the commit hash 8fac5af2b11dc98fa0504f6fd06df790164ec958 that the PR bumped from following the instructions at https://github.com/dotnet/diagnostics/blob/main/documentation/privatebuildtesting.md. As the timeout still reproduced, I additionally built the libs subset and copied the entire runtime\artifacts\bin\testhost\net9.0-windows-Release-x64\shared\Microsoft.NETCore.App\9.0.0\* as well, but the timeout still reproduced.

Moreover, I rebuilt the diagnostics repo after reverting the runtime bump pr #4743, and still the timeout reproduced.

As such, its unclear how far back the timeout began, or when the culprit change really occurred to start causing the test to timeout.


Timeout investigation

Debugging the test timeout itself, it seems to consistently hit an exception at


where the token is cancelled and gets caught at
catch (Exception ex) when (TryFailCompletionSourcesReturnFalse(ex))
. It's not clear to me exactly why the token is being cancelled, but seeing as how the MicrosoftDiagnosticsTracingTraceEventVersion version hasn't changed in a year, it doesn't seem related to that package.

@noahfalk @davmason Do y'all have any suspicions as to how the token might be getting cancelled?
I believe these were the async callstacks from DiagnosticsEventPipeProcessor.Process().
diagnostics_tests_cancelled

I am not super familiar with this part of the code, I don't have an idea for what is causing the token to be cancelled