`DiscordGatewayClient` allocation and method improvements

Description

As it stands right now, the gateway client is not only the single biggest, single-purpose class* in the library, but it is also the top allocator as shown by a profiler. It's unlikely that these allocations pose a memory leak, but it is partially unnecessary when it's possible to reduce these allocations.

Allocations were initially my main goal when it came to sprucing up the client, as described below, but as I continued to look for room for improvement I realized just how much of a mess the client really is (and, I don't fault Jax here, as the client is somewhat non-trivial in its operation).

However, there is a lot of code that could be simplified by being broken out, as I'll describe later in this issue.

* This is excluding ____API classes, which I would personally consider "multi-purpose".

Why This is Needed

As described above, the gateway client is responsible for the top five allocations across the entire span of the application lifetime, even in comparison to EF Core and the available Redis caching, which marshals objects when deserializing.

The screenshot below is the measurement over a six-hour period, and while the overall allocations are rather trivial for most, and for the most part will be GC'd, these can still be potentially reduced by trivial means.

Alternatives Considered

Short of rewriting the entire client (e.g. #176), there aren't many alternatives to be had here.

Additional Details

Here is the meat of the document, and I'll point out everything that I think can be reasonably fixed.

Passing around `heartbeat_interval`

Call this a nitpick, but I feel like this issue is just large enough to be somewhat of an issue; not necessarily for maintainability or functionality, but for ease of writing code. There's also potential defensive copying at play here, though that's a bit of a micro-optimization.

Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs

Line 585 in 7935da1

var heartbeatInterval = hello.Data.HeartbeatInterval;

`GatewaySenderAsync` issues

This is a bundle of issues, but I'll put it under one main section

Heartbeating should be its own loop

As shown below, the gateway prioritizes heartbeats over user-submitted payloads which is fine, but I think this can be slightly improved by adjusting how payload rate-limiting is accounted for.

Instead of having heartbeating and user payloads both be rate-limiting, we could instead reserve a single slot (heartbeating is ~45s, and it's 120 events/min, so there's only ever 1 heartbeat per ratelimit window).

That single slot ensures that heartbeats are always sent (and even if they aren't, this is technically OK because Discord doesn't immediately disconnect you from the socket). The only issue would be that the client automatically disconnects if the expected OP11 (HEARTBEAT_ACK) isn't received in time, but otherwise, this is a fine solution.

As for prioritization, a heartbeatLock field could be introduced; it'd be something along the lines of a SemaphoreSlim or even a ReadWriteLockSlim, and sending payloads waits on hearbeating to complete for consistency.

I'm pretty sure @carlst99 (Falcon) addressed this issue in the common gateway PR, but I'm not entirely certain.

Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs

Lines 924 to 963 in 7935da1

    
           var lastSentHeartbeatTime = ReadTimeAtomic(ref _lastSentHeartbeat); 
        
           var now = DateTime.UtcNow; 
        
           var safetyMargin = _gatewayOptions.GetTrueHeartbeatSafetyMargin(heartbeatInterval); 
        
           var needsHeartbeat = lastSentHeartbeatTime is null || 
        
                                now - lastSentHeartbeatTime >= heartbeatInterval - safetyMargin; 
        
           Result sendResult; 
        
           // Heartbeats are prioritized over user-submitted payloads, and are sent without using the rate limit 
        
           // policy to avoid timing issues. 
        
           if (needsHeartbeat) 
        
           { 
        
               var lastReceivedEventTime = ReadTimeAtomic(ref _lastReceivedEvent); 
        
               var lastHeartbeatAckTime = ReadTimeAtomic(ref _lastReceivedHeartbeatAck); 
        
               var isConnectionSilent = lastHeartbeatAckTime < lastSentHeartbeatTime && 
        
                                        now - lastReceivedEventTime >= heartbeatInterval - safetyMargin; 
        
               if (isConnectionSilent) 
        
               { 
        
                   return new GatewayError 
        
                   ( 
        
                       "The server did not respond in time with a heartbeat acknowledgement or an incoming event.", 
        
                       true, 
        
                       false 
        
                   ); 
        
               } 
        
               // 32-bit reads are atomic, so this is fine 
        
               var lastSequenceNumber = _lastSequenceNumber; 
        
               var heartbeatPayload = new Payload<IHeartbeat>(new Heartbeat 
        
               ( 
        
                   lastSequenceNumber == 0 ? null : lastSequenceNumber 
        
               )); 
        
               sendResult = await _transportService.SendPayloadAsync(heartbeatPayload, disconnectRequested); 
        
               WriteTimeAtomic(ref _lastSentHeartbeat, DateTime.UtcNow);

Sending payloads shouldn't rely on sleep time

In addendum to the above, this code only exists because of the above and can be safely fixed with a concrete delay. For reasons I'll get into below, using Task.Delay isn't the best choice for this either.

Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs

Lines 967 to 972 in 7935da1

    
           if (!_payloadsToSend.TryPeek(out var userPayload)) 
        
           { 
        
               // Sleep for a little bit 
        
               await Task.Delay(CalculateAllowedSleepTime(heartbeatInterval), disconnectRequested); 
        
               continue; 
        
           }

Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs

Lines 990 to 1001 in 7935da1

    
           var allowedSleepTime = CalculateAllowedSleepTime(heartbeatInterval); 
        
           if (rae.RetryAfter >= allowedSleepTime) 
        
           { 
        
               // Won't have time to send this until we have to heartbeat again... give up for now 
        
               await Task.Delay(allowedSleepTime, disconnectRequested); 
        
               sendResult = Result.FromSuccess(); 
        
               break; 
        
           } 
        
           await Task.Delay(rae.RetryAfter, disconnectRequested); 
        
           continue;

Edit: November 29th, 2022:

Another solution would be omitting the delay entirely, and looking into a different solution such as a Channel<T> for receiving payloads. One of Falcon's improvements upon the gateway client in his aforementioned PR was using a channel for this purpose, however.

https://github.com/carlst99/Remora.Discord/blob/798fa73a5c663c747b2fcaa30471b166f8a4757e/Backend/Remora.Discord.Gateway/BaseGatewayClient.cs#L535-L541

If heartbeating was seperated entirely, then instead of using the synchronous, and quick-returning TryRead, we could use WaitToReadAsync.

From there, we'd make a for loop that looks something along the lines of:

var canRead = await _payloadQueue.WaitForReadAsync(ct);

if (!canRead)
{
  return;
}

// The ordering of the loop doesn't matter here; I just think a reverse loop makes more sense for a queue
for (int i = _maxAllotableSlots; i < 0; i--)
{
  if (!_payloadQueue.TryRead(out var payload)
  {
    break;
  }
  
  _transportService.SendPayloadAsync(payload, ct);
}

Stop using `Task.Delay` (Please)

This isn't a blanket statement; it's a suggestion to use the alternative, given it's appropriate for the context.

With Remora targeting .NET 6/7 now, we're given access to the new APIs introduced in those versions, which allow for simplifying code and reducing allocations with just a few lines of code being changed.

PeriodicTimer is a new timer class introduced in .NET 6, and is an alternative to previous timer APIs which were unideal for one reason or another. PeriodicTimer's WaitForNextTickAsync method returns a ValueTask<bool>, which is already rather ideal.

Its return is indicative of whether it will continue to tick, so much like a ValueTask<Result>, gateway termination can be gracefully handled. Unlike Task.Delay, cancelling the token passed to WaitForNextTickAsync simply returns ValueTask<bool>(false) while the former throws an exception which must be caught (and is typically measurably slower, though async tends to smooth these nuances).

This was what I initially set out to do, because one of, if not the top allocation throughout the lifetime of the gateway is DelayPromise and TimerQueueTimer (Task.Delay allocates every time you call it!). For this reason, it's very unideal to use Task.Delay in a tight/semi-tight loop, especially when its delay is static, as it'll cause lots of avoidable heap allocations.

There's two notable spots where this is an issue.

Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs

Line 970 in 7935da1

    
           await Task.Delay(CalculateAllowedSleepTime(heartbeatInterval), disconnectRequested);

Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs

Line 632 in 7935da1

await Task.Delay(TimeSpan.FromMilliseconds(100), stopRequested);

Inconsistencies on which token is passed

This edges the line between nitpick and major issue, but it is potential for some absolute headaches for anyone who has to refactor/modify the gateway.

As it stands right now, two tokens are held. stopRequested and _disconnectRequestedSource; the latter of which is actually a CancellationTokenSource, but that's somewhat moot.

The issue is that these two are seemingly arbitrarily passed to methods, the latter of which is also cancelled for various reasons, however it's strewn throughout the entire class. The latter of the two is held as a field, and seems to be intended to be passed to external classes (e.g. the payload transport), but then sometimes it's not, and is for some reason passed as a parameter to methods within the gateway, when those methods can access the same field.

For example:

Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs

Line 587 in a974038

    
           _sendTask = GatewaySenderAsync(heartbeatInterval, _disconnectRequestedSource.Token);

And then just three lines down:

Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs

Line 590 in a974038

var connectResult = await AttemptConnectionAsync(stopRequested);

This also technically has the effect of potentially holding the callee captive; while a lot of methods throw when their cancellation token is cancelled, which will bubble up to the callee (RunAsync), if they don't, the method doesn't return until all the work has been completed.

I suggest using a linked cancellation token and consolidating on the usage of the field vs the parameter.

Nitpickings

`recommanded_shards` is not just a recommendation

My issue with this section, the client treats recommended_shards > set_shards as a warning when the proper log level for this situation is LogLevel.Error (and potentially even returning a preemptive gateway error). The reason being is that if the bot is in > 2500 * shards guilds, connecting to the gateway will ALWAYS return a 4011 (SHARDING_REQUIRED) error. See discord's docs

Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs

Lines 462 to 473 in 7935da1

    
           if (endpointInformation.Shards.IsDefined(out var shards)) 
        
           { 
        
               if (shards > 1 && _gatewayOptions.ShardIdentification?.ShardCount != shards) 
        
               { 
        
                   _log.LogInformation 
        
                   ( 
        
                       "Discord suggests {Shards} shards for this bot, but your sharding configuration does " + 
        
                       "not match this. Consider switching to a sharded topology of at least that many shards", 
        
                       shards 
        
                   ); 
        
               } 
        
           }

Re-calculating heartbeat windows

This is a small issue; very small even, but it's something I picked up on regardless. Despite this only being done once per (re-)connection, I find it strange that this is recalculated. Then again shifting this up to the constructor could be considered bad practice, even if it's technically static-ish data.

Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs

Lines 894 to 898 in 7935da1

    
           // Figure out how many slots we need to reserve for heartbeats 
        
           var rateLimitWindow = TimeSpan.FromSeconds(60); 
        
           var heartbeatsPerRateLimitWindow = (int)Math.Ceiling(rateLimitWindow / heartbeatInterval); 
        
           int maxEventsPerRateLimitWindow;

`new`ing timeout policies all over the place

I'm not entirely familiar with Polly, so I may be wrong here, but I find it somewhat daft that there are three instances of a timeout being new'd up, all with the same arbitrary(?) time limit, only to be used once. Are Polly timeout policies not thread-safe? Or I suppose not intended to be reused, otherwise, these should be merged into a field, IMO.

Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs

Line 545 in 7935da1

    
           var timeoutPolicy = Policy.TimeoutAsync<Result<IPayload>>(TimeSpan.FromSeconds(5));

Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs

Line 718 in 7935da1

    
           var timeoutPolicy = Policy.TimeoutAsync<Result<IPayload>>(TimeSpan.FromSeconds(5));

Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs

Line 545 in 7935da1

    
           var timeoutPolicy = Policy.TimeoutAsync<Result<IPayload>>(TimeSpan.FromSeconds(5));

There should be a common `DisconnectAsync` method

As it stands right now, there's no semblance of a disconnection method that handles the termination of the gateway (graceful or otherwise). Instead, inconsistent logic is sprinkled across call sites that varies based on context and can cause bugs.

Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs

Lines 235 to 250 in 7935da1

    
           _disconnectRequestedSource.Cancel(); 
        
           // The results of the send and receive tasks are discarded here, because the iteration result will 
        
           // contain whichever of them failed if any of them did 
        
           _ = await _sendTask; 
        
           _ = await _receiveTask; 
        
           if (_transportService.IsConnected) 
        
           { 
        
               var disconnectResult = await _transportService.DisconnectAsync(!stopRequested.IsCancellationRequested, stopRequested); 
        
               if (!disconnectResult.IsSuccess) 
        
               { 
        
                   // Couldn't disconnect cleanly :( 
        
                   return disconnectResult; 
        
               } 
        
           }

Remora.Discord/Backend/Remora.Discord.Gateway/DiscordGatewayClient.cs

Lines 637 to 664 in 7935da1

    
           if (!stopRequested.IsCancellationRequested) 
        
           { 
        
               if (!_shouldReconnect) 
        
               { 
        
                   return Result.FromSuccess(); 
        
               } 
        
               _log.LogInformation("Reconnection requested by the gateway; terminating session..."); 
        
           } 
        
           // Terminate the send and receive tasks 
        
           _disconnectRequestedSource.Cancel(); 
        
           // The results of the send and receive tasks are discarded here, because we know that it's going to be a 
        
           // cancellation 
        
           _ = await _sendTask; 
        
           _ = await _receiveTask; 
        
           var disconnectResult = await _transportService.DisconnectAsync(!stopRequested.IsCancellationRequested, stopRequested); 
        
           if (!disconnectResult.IsSuccess) 
        
           { 
        
               return disconnectResult; 
        
           } 
        
           // Set up the state for the new connection 
        
           _disconnectRequestedSource.Dispose(); 
        
           _disconnectRequestedSource = new CancellationTokenSource(); 
        
           _connectionStatus = GatewayConnectionStatus.Disconnected;

A good writeup.

For the PeriodicTimer it would be important to provide a fallback for TFM netstandard2.1

A good writeup.

For the PeriodicTimer it would be important to provide a fallback for TFM netstandard2.1

Thanks! I was under the impression that Remora would explicitly drop netstandard support in favor of .NET 6/7, since .NET Core 3.1 becomes EOL this year. If not, this would definitely need to be taken into consideration.

Nice writeup!

I'm pretty sure @carlst99 addressed this issue in the common gateway PR, but I'm not entirely certain.

Yep, I added a 'first priority' send queue, which internal connection management packets were placed on. Anything on this queue was dispatched before any other enqueued packets and if I recall correctly, I ensured all priority packets were dispatched before allowing the send loop to check if it should sleep.

I was also toying around with a few allocation improvements in the transport service. I'd managed to cut back on send allocations significantly, not that those are particularly troublesome, and using the RecyclableMemoryStream for receiving data was looking promising.

A good writeup.
For the PeriodicTimer it would be important to provide a fallback for TFM netstandard2.1

Thanks! I was under the impression that Remora would explicitly drop netstandard support in favor of .NET 6/7, since .NET Core 3.1 becomes EOL this year. If not, this would definitely need to be taken into consideration.

NetStandard does not EOL with Core 3.1 cause it is still a thing with NET 7.0:
https://learn.microsoft.com/en-us/dotnet/standard/net-standard?tabs=net-standard-2-1#select-net-standard-version

But in the end it still depends if Remora wants to support all netstandard runtimes even if they are EOL or just those that are not EOL.

NetStandard does not EOL with Core 3.1 cause it is still a thing with NET 7.0:
Yes, netstandard will still live on, but I meant to imply that last I remember (I could be misremembering; I'll look for it on the Discord), Jax wanted to switch away from netstandard all together in favor of .NET 6/7, partially for the benefits of access to more modern APIs. Though, probably best to ask @Nihlus what his goal/decision is in that regard anyway.

I found it. initial question and his initial answer, though everything's kind of up in the air since there hasn't been much further discussion since.

I have no intention of dropping support for .NET Standard - it will remain a runtime-agnostic library target.

#if blocks it is, then.

	var lastSentHeartbeatTime = ReadTimeAtomic(ref _lastSentHeartbeat);

	var now = DateTime.UtcNow;
	var safetyMargin = _gatewayOptions.GetTrueHeartbeatSafetyMargin(heartbeatInterval);

	var needsHeartbeat = lastSentHeartbeatTime is null \|\|
	now - lastSentHeartbeatTime >= heartbeatInterval - safetyMargin;

	Result sendResult;

	// Heartbeats are prioritized over user-submitted payloads, and are sent without using the rate limit
	// policy to avoid timing issues.
	if (needsHeartbeat)
	{
	var lastReceivedEventTime = ReadTimeAtomic(ref _lastReceivedEvent);
	var lastHeartbeatAckTime = ReadTimeAtomic(ref _lastReceivedHeartbeatAck);

	var isConnectionSilent = lastHeartbeatAckTime < lastSentHeartbeatTime &&
	now - lastReceivedEventTime >= heartbeatInterval - safetyMargin;

	if (isConnectionSilent)
	{
	return new GatewayError
	(
	"The server did not respond in time with a heartbeat acknowledgement or an incoming event.",
	true,
	false
	);
	}

	// 32-bit reads are atomic, so this is fine
	var lastSequenceNumber = _lastSequenceNumber;

	var heartbeatPayload = new Payload<IHeartbeat>(new Heartbeat
	(
	lastSequenceNumber == 0 ? null : lastSequenceNumber
	));

	sendResult = await _transportService.SendPayloadAsync(heartbeatPayload, disconnectRequested);
	WriteTimeAtomic(ref _lastSentHeartbeat, DateTime.UtcNow);

	if (!_payloadsToSend.TryPeek(out var userPayload))
	{
	// Sleep for a little bit
	await Task.Delay(CalculateAllowedSleepTime(heartbeatInterval), disconnectRequested);
	continue;
	}

	var allowedSleepTime = CalculateAllowedSleepTime(heartbeatInterval);
	if (rae.RetryAfter >= allowedSleepTime)
	{
	// Won't have time to send this until we have to heartbeat again... give up for now
	await Task.Delay(allowedSleepTime, disconnectRequested);

	sendResult = Result.FromSuccess();
	break;
	}

	await Task.Delay(rae.RetryAfter, disconnectRequested);
	continue;

	if (endpointInformation.Shards.IsDefined(out var shards))
	{
	if (shards > 1 && _gatewayOptions.ShardIdentification?.ShardCount != shards)
	{
	_log.LogInformation
	(
	"Discord suggests {Shards} shards for this bot, but your sharding configuration does " +
	"not match this. Consider switching to a sharded topology of at least that many shards",
	shards
	);
	}
	}

	// Figure out how many slots we need to reserve for heartbeats
	var rateLimitWindow = TimeSpan.FromSeconds(60);
	var heartbeatsPerRateLimitWindow = (int)Math.Ceiling(rateLimitWindow / heartbeatInterval);

	int maxEventsPerRateLimitWindow;

	_disconnectRequestedSource.Cancel();

	// The results of the send and receive tasks are discarded here, because the iteration result will
	// contain whichever of them failed if any of them did
	_ = await _sendTask;
	_ = await _receiveTask;

	if (_transportService.IsConnected)
	{
	var disconnectResult = await _transportService.DisconnectAsync(!stopRequested.IsCancellationRequested, stopRequested);
	if (!disconnectResult.IsSuccess)
	{
	// Couldn't disconnect cleanly :(
	return disconnectResult;
	}
	}

	if (!stopRequested.IsCancellationRequested)
	{
	if (!_shouldReconnect)
	{
	return Result.FromSuccess();
	}

	_log.LogInformation("Reconnection requested by the gateway; terminating session...");
	}

	// Terminate the send and receive tasks
	_disconnectRequestedSource.Cancel();

	// The results of the send and receive tasks are discarded here, because we know that it's going to be a
	// cancellation
	_ = await _sendTask;
	_ = await _receiveTask;

	var disconnectResult = await _transportService.DisconnectAsync(!stopRequested.IsCancellationRequested, stopRequested);
	if (!disconnectResult.IsSuccess)
	{
	return disconnectResult;
	}

	// Set up the state for the new connection
	_disconnectRequestedSource.Dispose();
	_disconnectRequestedSource = new CancellationTokenSource();
	_connectionStatus = GatewayConnectionStatus.Disconnected;

Description

Why This is Needed

Alternatives Considered

Additional Details

Passing around heartbeat_interval

GatewaySenderAsync issues

This is a bundle of issues, but I'll put it under one main section

Heartbeating should be its own loop

Sending payloads shouldn't rely on sleep time

Stop using Task.Delay (Please)

Inconsistencies on which token is passed

Nitpickings

recommanded_shards is not just a recommendation

Re-calculating heartbeat windows

newing timeout policies all over the place

There should be a common DisconnectAsync method

Passing around `heartbeat_interval`

`GatewaySenderAsync` issues

Stop using `Task.Delay` (Please)

`recommanded_shards` is not just a recommendation

`new`ing timeout policies all over the place

There should be a common `DisconnectAsync` method