dotnet/diagnostics

Timeout while executing StartEventPipeSession

AT-WH opened this issue · 3 comments

Description

In my code I follow the instructions from the docs and create a session using this snippet:

    var providers = new List<EventPipeProvider>()
    {
        new EventPipeProvider(
            "System.Runtime",
            EventLevel.Verbose,
            (long)ClrTraceEventParser.Keywords.None,
            new Dictionary<string, string>() {
                { "EventCounterIntervalSec", "1" }
            }
        )
    };
    var client = new DiagnosticsClient(processId);
    using(var session = client.StartEventPipeSession(providers))
    {
      // ... more code
    }

In most of the cases this works, but sometimes it fails to connect to a process running on my machine (note: I verified that processId is always correct). The error message is:

EXCEPTION: System.TimeoutException: The operation has timed out.
   at System.IO.Pipes.NamedPipeClientStream.ConnectInternal(Int32 timeout, CancellationToken cancellationToken, Int32 startTime)
   at Microsoft.Diagnostics.NETCore.Client.IpcEndpointHelper.Connect(IpcEndpointConfig config, TimeSpan timeout)
   at Microsoft.Diagnostics.NETCore.Client.PidIpcEndpoint.Connect(TimeSpan timeout)
   at Microsoft.Diagnostics.NETCore.Client.IpcClient.SendMessageGetContinuation(IpcEndpoint endpoint, IpcMessage message)
   at Microsoft.Diagnostics.NETCore.Client.EventPipeSession.Start(IpcEndpoint endpoint, EventPipeSessionConfiguration config)
   at Microsoft.Diagnostics.NETCore.Client.DiagnosticsClient.StartEventPipeSession(IEnumerable`1 providers, Boolean requestRundown, Int32 circularBufferMB)

This seems to be a random issue, so I can't provide a code to reliably reproduce it. I noticed that it occurs while my CPU is heavily loaded. In my environment I try spawning ~20 .NET processes that bring CPU utilization to 90-99%.

Then sometimes I see that a call to StartEventPipeSession freezes and times out after a while.

I've also tried calling the function once again after the timeout, but the result is the same.

It looks like as if the process I want to connect to failed to initialize the Event Counters for whatever reason. Is it possible?

Configuration

> dotnet --info

.NET SDK:
 Version:           8.0.101
 Commit:            6eceda187b
 Workload version:  8.0.100-manifests.9297b15b

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.19045
 OS Platform: Windows
 RID:         win-x64
 Base Path:   C:\Program Files\dotnet\sdk\8.0.101\

.NET workloads installed:
 Workload version: 8.0.100-manifests.9297b15b
There are no installed workloads to display.

Host:
  Version:      8.0.1
  Architecture: x64
  Commit:       bf5e279d92

.NET SDKs installed:
  6.0.201 [C:\Program Files\dotnet\sdk]
  7.0.101 [C:\Program Files\dotnet\sdk]
  7.0.404 [C:\Program Files\dotnet\sdk]
  8.0.100 [C:\Program Files\dotnet\sdk]
  8.0.101 [C:\Program Files\dotnet\sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.App 5.0.17 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 6.0.3 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 6.0.26 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 7.0.1 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 7.0.14 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 7.0.15 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 8.0.0 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 8.0.1 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 5.0.17 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 6.0.3 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 6.0.26 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 7.0.1 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 7.0.14 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 7.0.15 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 8.0.0 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 8.0.1 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.WindowsDesktop.App 5.0.17 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 6.0.3 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 6.0.26 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 7.0.1 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 7.0.14 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 7.0.15 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 8.0.0 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 8.0.1 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]

Other architectures found:
  x86   [C:\Program Files (x86)\dotnet]
    registered at [HKLM\SOFTWARE\dotnet\Setup\InstalledVersions\x86\InstallLocation]

Environment variables:
  Not set

global.json file:
  Not found

Learn more:
  https://aka.ms/dotnet/info

Download .NET:
  https://aka.ms/dotnet/download

Regression?

I saw the issue from time to time in the past, but now it became more problematic, as my IT system grew and I want to keep an eye on more processes.

Other information

My processes are launched using an .exe file that has the same name for all of the processes (i.e. each process uses a separate copy of the same binaries). Previously I used Performance Counters to track the stats of my processes and I remember that using the same name messes up with Performance Counters. Is it possible that it's causing problems here as well?

@AT-WH thanks for reporting this issue. I believe there are two possible explanations for the issue:

  1. The target process is either hung or stopped (e.g. a debugger is attached in break state), causing the Named Pipe to be non-responsive.
  2. The machine has significant load and so the server does not start start the IPC server in time before the diagnostics client times out.

Do the target processes have the DOTNET_DiagnosticPorts environment variable set? If yes please share the config string.

Also collecting a dump of the target process at the point where we are timing out might shed light on why the diagnostics ipc thread is not responding. This would tell us if something is preventing the diagnostics IPC server from processing messages:
https://github.com/dotnet/runtime/blob/c659c6e511da37385fd0e56941fd84c34ac5171d/src/native/eventpipe/ds-server.c#L115-L116

One possible workaround might be to try modifying the default connect timeout (currently 30 seconds) though I am not confident that the problem could be eliminated using this solution (just the frequency reduced):

internal static readonly TimeSpan ConnectTimeout = TimeSpan.FromSeconds(30);

Hi @tommcdon
Thanks for responding.

I confirm that I don't use DOTNET_DiagnosticPorts.
My machine is under very serious load and it's not uncommon that CPU utilization shows 100%, or something very close to this value.

I wonder if this is possible for a .NET process to 'decide' to skip initialization of the Event Counters in a situation of a very heavy load?

For now I switched to StartEventPipeSessionAsync and use a CancellationToken, which expires after 5 seconds. This won't solve the problem, but at least my system works in a normal way and the only drawback is that it fails to collect stats from the process.

Regarding this part

Also collecting a dump of the target process at the point where we are timing out might shed light on why the diagnostics ipc thread is not responding. This would tell us if something is preventing the diagnostics IPC server from processing messages.

Do you have a quick tutorial on how to do this? Typically I deal with 'a bit' higher level programming than this ;).