Remora/Remora.Discord

`DiscordGatewayClient` allocation and method improvements

VelvetToroyashi opened this issue · 8 comments

Description

As it stands right now, the gateway client is not only the single biggest, single-purpose class* in the library, but it is also the top allocator as shown by a profiler. It's unlikely that these allocations pose a memory leak, but it is partially unnecessary when it's possible to reduce these allocations.

Allocations were initially my main goal when it came to sprucing up the client, as described below, but as I continued to look for room for improvement I realized just how much of a mess the client really is (and, I don't fault Jax here, as the client is somewhat non-trivial in its operation).

However, there is a lot of code that could be simplified by being broken out, as I'll describe later in this issue.

* This is excluding ____API classes, which I would personally consider "multi-purpose".

Why This is Needed

As described above, the gateway client is responsible for the top five allocations across the entire span of the application lifetime, even in comparison to EF Core and the available Redis caching, which marshals objects when deserializing.

The screenshot below is the measurement over a six-hour period, and while the overall allocations are rather trivial for most, and for the most part will be GC'd, these can still be potentially reduced by trivial means.

image

Alternatives Considered

Short of rewriting the entire client (e.g. #176), there aren't many alternatives to be had here.

Additional Details

Here is the meat of the document, and I'll point out everything that I think can be reasonably fixed.

Passing around heartbeat_interval

Call this a nitpick, but I feel like this issue is just large enough to be somewhat of an issue; not necessarily for maintainability or functionality, but for ease of writing code. There's also potential defensive copying at play here, though that's a bit of a micro-optimization.

var heartbeatInterval = hello.Data.HeartbeatInterval;


GatewaySenderAsync issues

This is a bundle of issues, but I'll put it under one main section

Heartbeating should be its own loop

As shown below, the gateway prioritizes heartbeats over user-submitted payloads which is fine, but I think this can be slightly improved by adjusting how payload rate-limiting is accounted for.

Instead of having heartbeating and user payloads both be rate-limiting, we could instead reserve a single slot (heartbeating is ~45s, and it's 120 events/min, so there's only ever 1 heartbeat per ratelimit window).

That single slot ensures that heartbeats are always sent (and even if they aren't, this is technically OK because Discord doesn't immediately disconnect you from the socket). The only issue would be that the client automatically disconnects if the expected OP11 (HEARTBEAT_ACK) isn't received in time, but otherwise, this is a fine solution.

As for prioritization, a heartbeatLock field could be introduced; it'd be something along the lines of a SemaphoreSlim or even a ReadWriteLockSlim, and sending payloads waits on hearbeating to complete for consistency.

I'm pretty sure @carlst99 (Falcon) addressed this issue in the common gateway PR, but I'm not entirely certain.

var lastSentHeartbeatTime = ReadTimeAtomic(ref _lastSentHeartbeat);
var now = DateTime.UtcNow;
var safetyMargin = _gatewayOptions.GetTrueHeartbeatSafetyMargin(heartbeatInterval);
var needsHeartbeat = lastSentHeartbeatTime is null ||
now - lastSentHeartbeatTime >= heartbeatInterval - safetyMargin;
Result sendResult;
// Heartbeats are prioritized over user-submitted payloads, and are sent without using the rate limit
// policy to avoid timing issues.
if (needsHeartbeat)
{
var lastReceivedEventTime = ReadTimeAtomic(ref _lastReceivedEvent);
var lastHeartbeatAckTime = ReadTimeAtomic(ref _lastReceivedHeartbeatAck);
var isConnectionSilent = lastHeartbeatAckTime < lastSentHeartbeatTime &&
now - lastReceivedEventTime >= heartbeatInterval - safetyMargin;
if (isConnectionSilent)
{
return new GatewayError
(
"The server did not respond in time with a heartbeat acknowledgement or an incoming event.",
true,
false
);
}
// 32-bit reads are atomic, so this is fine
var lastSequenceNumber = _lastSequenceNumber;
var heartbeatPayload = new Payload<IHeartbeat>(new Heartbeat
(
lastSequenceNumber == 0 ? null : lastSequenceNumber
));
sendResult = await _transportService.SendPayloadAsync(heartbeatPayload, disconnectRequested);
WriteTimeAtomic(ref _lastSentHeartbeat, DateTime.UtcNow);

Sending payloads shouldn't rely on sleep time

In addendum to the above, this code only exists because of the above and can be safely fixed with a concrete delay. For reasons I'll get into below, using Task.Delay isn't the best choice for this either.

if (!_payloadsToSend.TryPeek(out var userPayload))
{
// Sleep for a little bit
await Task.Delay(CalculateAllowedSleepTime(heartbeatInterval), disconnectRequested);
continue;
}

var allowedSleepTime = CalculateAllowedSleepTime(heartbeatInterval);
if (rae.RetryAfter >= allowedSleepTime)
{
// Won't have time to send this until we have to heartbeat again... give up for now
await Task.Delay(allowedSleepTime, disconnectRequested);
sendResult = Result.FromSuccess();
break;
}
await Task.Delay(rae.RetryAfter, disconnectRequested);
continue;

Edit: November 29th, 2022:

Another solution would be omitting the delay entirely, and looking into a different solution such as a Channel<T> for receiving payloads. One of Falcon's improvements upon the gateway client in his aforementioned PR was using a channel for this purpose, however.

https://github.com/carlst99/Remora.Discord/blob/798fa73a5c663c747b2fcaa30471b166f8a4757e/Backend/Remora.Discord.Gateway/BaseGatewayClient.cs#L535-L541

If heartbeating was seperated entirely, then instead of using the synchronous, and quick-returning TryRead, we could use WaitToReadAsync.

From there, we'd make a for loop that looks something along the lines of:

var canRead = await _payloadQueue.WaitForReadAsync(ct);

if (!canRead)
{
  return;
}

// The ordering of the loop doesn't matter here; I just think a reverse loop makes more sense for a queue
for (int i = _maxAllotableSlots; i < 0; i--)
{
  if (!_payloadQueue.TryRead(out var payload)
  {
    break;
  }
  
  _transportService.SendPayloadAsync(payload, ct);
}

Stop using Task.Delay (Please)

This isn't a blanket statement; it's a suggestion to use the alternative, given it's appropriate for the context.

With Remora targeting .NET 6/7 now, we're given access to the new APIs introduced in those versions, which allow for simplifying code and reducing allocations with just a few lines of code being changed.

PeriodicTimer is a new timer class introduced in .NET 6, and is an alternative to previous timer APIs which were unideal for one reason or another. PeriodicTimer's WaitForNextTickAsync method returns a ValueTask<bool>, which is already rather ideal.

Its return is indicative of whether it will continue to tick, so much like a ValueTask<Result>, gateway termination can be gracefully handled. Unlike Task.Delay, cancelling the token passed to WaitForNextTickAsync simply returns ValueTask<bool>(false) while the former throws an exception which must be caught (and is typically measurably slower, though async tends to smooth these nuances).

This was what I initially set out to do, because one of, if not the top allocation throughout the lifetime of the gateway is DelayPromise and TimerQueueTimer (Task.Delay allocates every time you call it!). For this reason, it's very unideal to use Task.Delay in a tight/semi-tight loop, especially when its delay is static, as it'll cause lots of avoidable heap allocations.

There's two notable spots where this is an issue.

await Task.Delay(CalculateAllowedSleepTime(heartbeatInterval), disconnectRequested);

await Task.Delay(TimeSpan.FromMilliseconds(100), stopRequested);


Inconsistencies on which token is passed

This edges the line between nitpick and major issue, but it is potential for some absolute headaches for anyone who has to refactor/modify the gateway.

As it stands right now, two tokens are held. stopRequested and _disconnectRequestedSource; the latter of which is actually a CancellationTokenSource, but that's somewhat moot.

The issue is that these two are seemingly arbitrarily passed to methods, the latter of which is also cancelled for various reasons, however it's strewn throughout the entire class. The latter of the two is held as a field, and seems to be intended to be passed to external classes (e.g. the payload transport), but then sometimes it's not, and is for some reason passed as a parameter to methods within the gateway, when those methods can access the same field.

For example:

_sendTask = GatewaySenderAsync(heartbeatInterval, _disconnectRequestedSource.Token);

And then just three lines down:

var connectResult = await AttemptConnectionAsync(stopRequested);

This also technically has the effect of potentially holding the callee captive; while a lot of methods throw when their cancellation token is cancelled, which will bubble up to the callee (RunAsync), if they don't, the method doesn't return until all the work has been completed.

I suggest using a linked cancellation token and consolidating on the usage of the field vs the parameter.


Nitpickings

recommanded_shards is not just a recommendation

My issue with this section, the client treats recommended_shards > set_shards as a warning when the proper log level for this situation is LogLevel.Error (and potentially even returning a preemptive gateway error). The reason being is that if the bot is in > 2500 * shards guilds, connecting to the gateway will ALWAYS return a 4011 (SHARDING_REQUIRED) error. See discord's docs

if (endpointInformation.Shards.IsDefined(out var shards))
{
if (shards > 1 && _gatewayOptions.ShardIdentification?.ShardCount != shards)
{
_log.LogInformation
(
"Discord suggests {Shards} shards for this bot, but your sharding configuration does " +
"not match this. Consider switching to a sharded topology of at least that many shards",
shards
);
}
}

Re-calculating heartbeat windows

This is a small issue; very small even, but it's something I picked up on regardless. Despite this only being done once per (re-)connection, I find it strange that this is recalculated. Then again shifting this up to the constructor could be considered bad practice, even if it's technically static-ish data.

// Figure out how many slots we need to reserve for heartbeats
var rateLimitWindow = TimeSpan.FromSeconds(60);
var heartbeatsPerRateLimitWindow = (int)Math.Ceiling(rateLimitWindow / heartbeatInterval);
int maxEventsPerRateLimitWindow;

newing timeout policies all over the place

I'm not entirely familiar with Polly, so I may be wrong here, but I find it somewhat daft that there are three instances of a timeout being new'd up, all with the same arbitrary(?) time limit, only to be used once. Are Polly timeout policies not thread-safe? Or I suppose not intended to be reused, otherwise, these should be merged into a field, IMO.

var timeoutPolicy = Policy.TimeoutAsync<Result<IPayload>>(TimeSpan.FromSeconds(5));

var timeoutPolicy = Policy.TimeoutAsync<Result<IPayload>>(TimeSpan.FromSeconds(5));

var timeoutPolicy = Policy.TimeoutAsync<Result<IPayload>>(TimeSpan.FromSeconds(5));

There should be a common DisconnectAsync method

As it stands right now, there's no semblance of a disconnection method that handles the termination of the gateway (graceful or otherwise). Instead, inconsistent logic is sprinkled across call sites that varies based on context and can cause bugs.

_disconnectRequestedSource.Cancel();
// The results of the send and receive tasks are discarded here, because the iteration result will
// contain whichever of them failed if any of them did
_ = await _sendTask;
_ = await _receiveTask;
if (_transportService.IsConnected)
{
var disconnectResult = await _transportService.DisconnectAsync(!stopRequested.IsCancellationRequested, stopRequested);
if (!disconnectResult.IsSuccess)
{
// Couldn't disconnect cleanly :(
return disconnectResult;
}
}

if (!stopRequested.IsCancellationRequested)
{
if (!_shouldReconnect)
{
return Result.FromSuccess();
}
_log.LogInformation("Reconnection requested by the gateway; terminating session...");
}
// Terminate the send and receive tasks
_disconnectRequestedSource.Cancel();
// The results of the send and receive tasks are discarded here, because we know that it's going to be a
// cancellation
_ = await _sendTask;
_ = await _receiveTask;
var disconnectResult = await _transportService.DisconnectAsync(!stopRequested.IsCancellationRequested, stopRequested);
if (!disconnectResult.IsSuccess)
{
return disconnectResult;
}
// Set up the state for the new connection
_disconnectRequestedSource.Dispose();
_disconnectRequestedSource = new CancellationTokenSource();
_connectionStatus = GatewayConnectionStatus.Disconnected;

A good writeup.

For the PeriodicTimer it would be important to provide a fallback for TFM netstandard2.1

A good writeup.

For the PeriodicTimer it would be important to provide a fallback for TFM netstandard2.1

Thanks! I was under the impression that Remora would explicitly drop netstandard support in favor of .NET 6/7, since .NET Core 3.1 becomes EOL this year. If not, this would definitely need to be taken into consideration.

Nice writeup!

I'm pretty sure @carlst99 addressed this issue in the common gateway PR, but I'm not entirely certain.

Yep, I added a 'first priority' send queue, which internal connection management packets were placed on. Anything on this queue was dispatched before any other enqueued packets and if I recall correctly, I ensured all priority packets were dispatched before allowing the send loop to check if it should sleep.

I was also toying around with a few allocation improvements in the transport service. I'd managed to cut back on send allocations significantly, not that those are particularly troublesome, and using the RecyclableMemoryStream for receiving data was looking promising.

A good writeup.
For the PeriodicTimer it would be important to provide a fallback for TFM netstandard2.1

Thanks! I was under the impression that Remora would explicitly drop netstandard support in favor of .NET 6/7, since .NET Core 3.1 becomes EOL this year. If not, this would definitely need to be taken into consideration.

NetStandard does not EOL with Core 3.1 cause it is still a thing with NET 7.0:
https://learn.microsoft.com/en-us/dotnet/standard/net-standard?tabs=net-standard-2-1#select-net-standard-version

But in the end it still depends if Remora wants to support all netstandard runtimes even if they are EOL or just those that are not EOL.

NetStandard does not EOL with Core 3.1 cause it is still a thing with NET 7.0:
Yes, netstandard will still live on, but I meant to imply that last I remember (I could be misremembering; I'll look for it on the Discord), Jax wanted to switch away from netstandard all together in favor of .NET 6/7, partially for the benefits of access to more modern APIs. Though, probably best to ask @Nihlus what his goal/decision is in that regard anyway.

I found it. initial question and his initial answer, though everything's kind of up in the air since there hasn't been much further discussion since.

I have no intention of dropping support for .NET Standard - it will remain a runtime-agnostic library target.

#if blocks it is, then.