uklans/cache-domains

Speed issues with Valorant patching

33Fraise33 opened this issue · 20 comments

Describe the issue you are having

There seems to be an issue with Valorant speeds on lancache. As referenced in issue #164 Valorant caching has been enabled for a while but seems to be working ineffectively.

Describe your setup?

Unbound DNS server towards lancache docker container

Are you running sniproxy

no

DNS Configuration

Visible here: https://github.com/33Fraise33/personal-ansible/tree/main/roles/unbound/tasks

Noticing the same issue with lol, assuming its a riot issue as the launcher is deciding that their is no connection to the servers and ending the connection (error 004 patching failed). The launcher will then restart the download for a few seconds before erroring out again.

encg commented

Are there still speed issues with Valorant and LoL?

the patching system for riot requests lots of small byte ranges. This means that your client will behave very strangely through the cache. You will see it do nothing for a while, at this time the cache is fetching all the 1m slices that those byte ranges relate to. Then you will see it spike to max as it hands back a chunk at max speed. In the worst case you have many small byte ranges all in different slices. Its not something we think we can do anything about but perhaps @v3n can shed some light?

I've sorted the issues by disabling slicing. I don't recommend this as performance is worse when caching other games but as I run a very small LAN on a slow connection having to wait for riot to patch with clients seeing 0.1kb/s for several minutes was just impossible as they would start messing with their DNS and restarting the client to try and solve the issue themselves.

I've sorted the issues by disabling slicing. I don't recommend this as performance is worse when caching other games but as I run a very small LAN on a slow connection having to wait for riot to patch with clients seeing 0.1kb/s for several minutes was just impossible as they would start messing with their DNS and restarting the client to try and solve the issue themselves.

Interesting, do you have an idea how it impacts other services like windows updates?
Another option is to set-up a different instance for riotgames only

No obvious downside with windows update but hard to benchmark. BattleNet is even worse on first download but as there is a prefill tool for it its much less of an issue. Some steam games also have this issue (i presume this is because they contain large files) but again the prefill tool negates most of the issue.

I think optimally 2 instances should be used to give the best performance but as I'm only on Gigabit Lan it works well enough without slicing.

@v3n do you have any info on this, it has been open for a while. The workaround of @Sidesharks does help a bit but is not recommended by the lancache docs.

Is there any update to this case? @v3n are you still able to provide us with insights on this?

Hello,

Is there any hope in having this patched by the dev team, anything else we can do?

@jblazquez sorry for the ping but maybe there is something we can do about this :D !

Hi @IIPoliII, thanks for the ping.

So if I understand the issue correctly, the problem is that the small HTTP range requests that the Riot patcher makes do not work well with lancache?

Unfortunately, as explained on my original post when we switched to the new patcher, this is how downloads work now:

Another thing that's changing is the nature of our HTTP requests. We will make many more individual requests compared to before, and each request will be an HTTP range request, potentially with multiple byte ranges at a time. This will allow us to retrieve just the data that we need from the CDN. Our patcher expects CDNs to honor these range requests and respond with a 206 code, but it can gracefully handle 200 responses as well, albeit with reduced efficiency.

We rely on CDNs (and caches) being able to handle multipart HTTP range requests efficiently, because that is how we retrieve all of the chunks of data that we need (here is an article that I wrote around that time explaining a bit more how the patcher works).

I'm not familiar with how lancache slicing works exactly, but in theory that should be a good approach: retrieving 1MB ranges of data around the requested bytes, then caching them for future requests that hit those same ranges (and eventually caching the full file). I'm pretty sure that's how Akamai's CDN works, for example.

Like with any cache, the first requests for uncached objects will take longer as the cache needs to fill the data from the origin, but after that initial retrieval, other people should be experiencing fast speeds. Is that not the case?

Sorry I can't be of much help. I don't know much about the internals of this caching system.

@Sidesharks, you mentioned a prefill tool for Blizzard games. Can you give me some more details on that? How does the tool work?

Hi @IIPoliII, thanks for the ping.

So if I understand the issue correctly, the problem is that the small HTTP range requests that the Riot patcher makes do not work well with lancache?

Unfortunately, as explained on my original post when we switched to the new patcher, this is how downloads work now:

Another thing that's changing is the nature of our HTTP requests. We will make many more individual requests compared to before, and each request will be an HTTP range request, potentially with multiple byte ranges at a time. This will allow us to retrieve just the data that we need from the CDN. Our patcher expects CDNs to honor these range requests and respond with a 206 code, but it can gracefully handle 200 responses as well, albeit with reduced efficiency.

We rely on CDNs (and caches) being able to handle multipart HTTP range requests efficiently, because that is how we retrieve all of the chunks of data that we need (here is an article that I wrote around that time explaining a bit more how the patcher works).

I'm not familiar with how lancache slicing works exactly, but in theory that should be a good approach: retrieving 1MB ranges of data around the requested bytes, then caching them for future requests that hit those same ranges (and eventually caching the full file). I'm pretty sure that's how Akamai's CDN works, for example.

Like with any cache, the first requests for uncached objects will take longer as the cache needs to fill the data from the origin, but after that initial retrieval, other people should be experiencing fast speeds. Is that not the case?

Sorry I can't be of much help. I don't know much about the internals of this caching system.

Hi Javier, thanks for taking the time to respond on this thread. I've done a good bit of digging into this info, and I can fill you in a bit more on why (at least from my understanding) that it is happening.

Lancache is currently configured to use the 1MB(source) ranges that you suggested, and they work great for most CDNs that use range requests. Any range requests will be returned as expected with a 206 status code as well as the exact number of bytes requested.

Things don't work quite the same way when using multipart range requests. Instead of Nginx properly returning a 206 for the multiple ranges, it instead returns the entire file with a 200 status code. It doesn't matter how small or large the multiple ranges are, even a few bytes still gets the whole file. An example of this is from Nginx's request logs, where you can see that 4 bytes are requested, but instead gets a 200 status with 7,281,439 bytes received :
[riot] 192.168.1.55 / - - - [13/Aug/2024:14:28:18 -0400] "GET /channels/public/bundles/95037FF3C4E972FC.bundle HTTP/1.1" 200 7281439 "-" "axios/1.7.2" "HIT" "lol.dyn.riotcdn.net" "bytes=1-2,6-7"

From reading through the Riot client logs it looks like this is causing the client to freak out over getting an unexpected 200 response, which will then make it attempt retries until it succeeds with a 206. Which unfortunately for the client it won't ever succeed, because its never going to get anything other than a 200 back from Nginx. I can confirm this behavior since I can see errors like this being written endlessly to the client logs :
000009.447| WARN| SDK: patch: Error making 3514938-byte 3-range (0-1200219,1207498-1295712,1300592-3527094) request with ID 40 / 8b2a89c3fe714303-EWR to server 192.168.1.223 from CDN cloudflare via POP EWR at priority 0 over connection 1 for URL http://lol.dyn.riotcdn.net/channels/public/bundles/EAD532B4C83C355A.bundle after 0.001 secs, 0 downloaded bytes, and 3 retries: Server didn't honor range request

The behavior here is most certainly the fault of Nginx, and it can be corrected by disabling slicing all together. However for some CDNs like Battle.net that would be extremely undesirable as they pack their content into 256MB archives on their CDN, and for some games like World of Warcraft there are over a large number of archives, 1,234 at the time of writing. Hitting all of them even for a single byte would be using nearly 308GB just to be able to cache the actual download size of 97GB.

As far as where the solution for this lies, I'm not really sure of it at the moment. However I hope that maybe with some more info we could work towards a solution for this.

@Sidesharks, you mentioned a prefill tool for Blizzard games. Can you give me some more details on that? How does the tool work?

The prefill tool for Blizzard games is battlenet-lancache-prefill, and there is a Steam and Epic version as well. I also have a Riot one that I've been working on, but that's private on Github at the moment since its not complete. These tools are all functionally identical, and I would recommend that if you would like to look at the documentation for these tools to look at the Steam version, since it has the most up to date and detailed docs.

Battle.net has a similar issue to the Riot issue where the initial uncached download via the Battle.net client will be extremely slow, below 10mbit/s at best. Once everything has been cached there is no issue at all, all of the range requests made by the client come back as expected. As far as why the Battle.net client has an issue I have never completely determined, however since there is no way to adjust how the client itself works with Lancache I decided to take the approach of writing my own client.

BattlenetPrefill is simply a custom client that downloads the appropriate Battle.net manifests, builds out the list of requests that need to be made in order to download from their CDNs, and then downloads them in parallel as quickly as possible. Anything past downloading is skipped, so no validation, decompression, or writing to disk. Since all of those extraneous steps are skipped, BattlenetPrefill (as well as the other prefills) can pull from the CDNs faster than the actual client ever could. I've seen it tested as high as getting 5gbit/s over WAN.

The intended workflow for BattlenetPrefill is to use it to prime your cache ahead of time, thus with all of the data cached end users will no longer suffer from the download stalling issue in the actual Battle.net client when installing a game.

@tpill90, thanks for the detailed explanation!

Yes, NGINX does have an issue where it will not honor multipart range requests on cache miss, and that is precisely why we have retries when we receive a 200 OK response unexpectedly. We noticed this issue on Cloudflare during the rollout of the new patcher, and Cloudflare even prepared a patch to NGINX to solve this, but the NGINX developers did not accept the patch. See some of the discussion here and here.

The Riot patcher will retry multipart range requests up to 5 times, and if the server insists on returning a 200 OK even after the fifth retry, we will just accept the full response body and will just read the bytes that we need from it. This can be pretty inefficient, but at least we won't fail the download entirely.

So unfortunately the issue remains in NGINX, and efficient multipart range requests will work only on cache hit. I think prefilling is going to be the solution here.

We can't share our internal tools, which can be used for prefilling, but I think there are publicly available tools you can use quite easily. If you're not familiar with @moonshadow565 tools for processing Riot patcher manifests and bundles, you can find them here: https://github.com/moonshadow565/rman

The rman-bl tool can be used to list all of the bundles referenced by a release manifest. For example, if you download the current release manifest for VALORANT, you can dump the list of bundles like this:

$ rman-bl EB9EF8EA7C032A8B.manifest
/21FF2414C8DC57CD.bundle
/4700D1B0182D4506.bundle
/2DBDB3D4F48F2CCA.bundle
/27A59AACA33496C1.bundle
/822C36DCC6B442B8.bundle
/CB4BE1A8D7216DF0.bundle
/CE9DF8A8C5C66721.bundle
/FBAD531208E752A9.bundle
/EA8F270161EBDF21.bundle
...

And then you can tell your prefilling tool or script to download those bundles from https://valorant.dyn.riotcdn.net/channels/public/bundles/ to fill the cache.

Note that I can't vouch for moonshadow's tools, but I believe they work well, and may be useful for your use case.

Let me know if this helps.

Hey Javier,

Thanks for the comprehensive response. It's been incredibly useful and has confirmed what we had already expected.

The crux is that it does appear that nginx's inability to deal with multiple-range requests is triggering this behaviour, and makes for a less-than-ideal experience for the end user. We always strive to make Lancache as transparent to the user as possible, so we're going to investigate rolling the patch you linked into the nginx build that Lancache uses by default. I believe this patch is the missing link we needed to support this behaviour, so thanks for pointing it out!

@tpill90 has already done some sterling work on our end, and from preliminary testing it appears that the patched nginx does successfully return a 206 for a multi-range request, and also appears to work properly with the slicing module that we use in Lancache. We have a test candidate LAN party this weekend where this can be tested at a slightly larger scale to see if there are any regressions with any other CDN provider.

We'll report back and let you know how it goes.

I'm glad that very old patch still works! I'm curious to hear how the event this weekend goes. Please update this thread if you can :)

@jblazquez A quick follow up question for you. Is there a max size for the bundles? I checked your blog post and I couldn't see anywhere you mentioned a size for the bundles, and from what I'm seeing from my testing it seems that they don't go over 8mb. I just wanted to know if there would be any edge cases I haven't seen so far. Thanks again!

Hi all,

We have seen similar issues and are looking for a solution.
Could we get a beta/pre-release version with the provided nginx patch to test as well?

The lancache we have services a large number of clients.

Thank you

@jblazquez A quick follow up question for you. Is there a max size for the bundles?

Sorry, didn’t get notified about this reply.

Yes, the very maximum size of bundles is 75MB, but they rarely go above 16MB or so. They will probably get larger on average than now in the near future - so don’t hardcode an 8MB maximum - but 75MB is the max we support in code.