dotnet/aspnetcore

Remove stateful prerendering

rynowak opened this issue · 15 comments

Summary

Blazor has a feature that I'll refer to as stateful-prerendering where you render a group of components on the server, and then retain that group of components in memory until the client opens a SignalR connection (or it times out).

This currently has some really bad scalability problems, and we think it's not viable for use in production with the current implementation.

However, this is a nicer developer experience than some of the alternatives because a variety of different scenarios just work.

I'm sure that if we had lots of time we could provide a set of complicated knobs and dials that developers could use to try and make this feature viable for their use case and risks, we're out of time to do so. The extra things that stateful-prerendering enables are not on the main path for server-side Blazor, and so it should be cut entirely.

Proposed Template Changes

These changes should take place in Preview 8 so that we can gather feedback

We'll use a different technique to wire up the entry point component in the template.

            app.UseEndpoints(endpoints =>
            {
                endpoints.MapBlazorHub<App>(selector: "app");
                endpoints.MapFallbackToPage("/_Host");
            });

We'll still use prerendering in the template, but we'll call out how to remove it.

    <app>
        @* Remove the following line of code to disable prerendering *@
        @(await Html.RenderStaticComponentAsync<App>())
    </app>

See notes below about RenderStaticComponentAsync

Proposed Product Changes

These changes can take place as late as Preview 9 - but earlier is always better

We should remove the functionality for stateful-prerendering from the server. We're not recommending it for use in any production scenario, and so we should remove the functionality.

The Microsoft.AspNetCore.Mvc.Components.Prerendering package can be removed.

The RenderStaticComponentAsync methods should be renamed to RenderComponentAsync.

Various implementation pieces can be cleaned up as-desired. We can't predict right now if we'll bring back the same feature in the same way, so the cleanest thing to do would be to remove all of the unneeded functionality.

Note that as a prerequisite to making these changes, we also have to fix #11799, otherwise the template will be broken by default.

Can we put the appropriate supporting data that showcases the concerns we have in this issue for future reference in case we want to bring the feature back at a later point?

Sure - @pranavkm

Will this affect the way we embed blazor components in MVC/Razor Pages?
I'm building an enterprise web portal with razor pages, and for the interactive parts I went for blazor components embedded in them , will I be in trouble with this change?

Regards

Will this affect the way we embed blazor components in MVC/Razor Pages?

Yes, it won't be possible to pass parameters into the top level component and have it retain state. So you can definitely host interactive widgets on pages - but it won't be straightforward to pass data in.

Will this affect the way we embed blazor components in MVC/Razor Pages?

Yes, it won't be possible to pass parameters into the top level component and have it retain state. So you can definitely host interactive widgets on pages - but it won't be straightforward to pass data in.

Oh ok, so if I understood correctly, from the docs:

https://docs.microsoft.com/en-us/aspnet/core/blazor/components?view=aspnetcore-3.0#integrate-components-into-razor-pages-and-mvc-apps

@(await Html.RenderComponentAsync<Counter>(new { IncrementAmount = 10 }))

This won't be possible anymore and there will be another way to pass that parameter to the components?

Also, is fine to answer this kind of questions here or am I polluting the thread?
Regards

@(await Html.RenderComponentAsync<Counter>(new { IncrementAmount = 10 }))

Right. This is what I'm saying will be removed. We haven't created another way to pass parameters into components for 3.0.

The problem is that this relies on creating the Counter component while the request is happening, and keeping it around in memory until some time in the future when the browser connects back.

We've been doing measurements of this and it's going to be really hard to run in production without problems - if you have untrusted clients, they can force the server to do a lot of work or keep a lot of things in memory.

Also, is fine to answer this kind of questions here or am I polluting the thread?

Perfectly fine. If you want to give us some examples of things you're trying to do we can try to provide more information or ideas.

Here are my notes from when I ran this on the Benchmark server:


  1. I was using our default template as the application + preview7 bits
  2. Clients attempt to DOS by attempting the exhaust the disconnected circuit buffer. To do this, they do a HTTP GET on the home page. This triggers our pre-rendering
  3. I used the benchmarking infrastructure. It has workers that support performing a constant RPS over vanilla HTTP
  4. I used a browser to verify that the app was still functional while being DOSed. It’s not a very scientific way to test, but it was good enough for our needs.

With the size limit that we have by default, the site’s immediately unusable. Reconnects consistenty fail and the build logs filled with stuff like this:

[40m[37mdbug[39m[22m[49m: Microsoft.AspNetCore.Components.Server.Circuits.CircuitRegistry[111]
      Circuit with id CfDJ8F_JmI3ECwpDqUv08g88IdLWucBmdKhe4nUGPqlyXvASEQXyuTM_3KU27VBAdLIwiOGULRO-2asWRPzFvdQDGDU8D-JRj1i7ikj-BPbEXGHhTN0F4v4r2UGctcdkF5QHm0rWiZ4HIT5TMua-i1P7TzC4rZ3ssA9RhTkN4IVGkvm4 evicted due to Capacity.
[40m[37mdbug[39m[22m[49m: Microsoft.AspNetCore.Components.Server.Circuits.CircuitHost[105]
      Closing circuit with id CfDJ8F_JmI3ECwpDqUv08g88IdKW7WSP3Vj9OyXXP0HJctvy2guLEn3bhnFBVIngaxnPtLHA4MKF8SjG1rLnNL7bUCDsP-eaeoVg-8tRrqoh4Ng1s8DD1O-rEhdvcXwCHUTMqI3mI_oyiBlWA5W0TQrB8ZUCEhcWn3yBaAklWdsab5zy.

Ryan suggested removing the size limit entirely (set to int32.MaxValue) and that does make the application way more usable. I played around with different timeouts and RPS values to see what the app looks like.

Here’s one set of data points:

500	30	RequestsPerSecond:           852
Max CPU (%):                 21
WorkingSet (MB):             471
Avg. Latency (ms):           6.42
1500	30	RequestsPerSecond:           14,296
Max CPU (%):                 76
WorkingSet (MB):             3,975
Avg. Latency (ms):           21.13
3500	30	RequestsPerSecond:           7,961
Max CPU (%):                 99
WorkingSet (MB):             4,624
Avg. Latency (ms):           116.42

Approaching higher RPS, the working set tends to be really large. To test that this is primarily coming from pre-rendering, I tried out what it looked like with static pre-rendering:

500	RequestsPerSecond:           902
Max CPU (%):                 17
WorkingSet (MB):             182
Avg. Latency (ms):           6.73
1500	RequestsPerSecond:           2,995
Max CPU (%):                 65
WorkingSet (MB):             193
Avg. Latency (ms):           7.27
3500	RequestsPerSecond:           7,744
Max CPU (%):                 101
WorkingSet (MB):             187
Avg. Latency (ms):           96.89

Effectively for a small amount of work that the client performed (make a HTTP GET request) the server would allocate a ton of memory waiting for the client to reconnect. By spamming the server with GETs you could effectively DOS the server or require that the server preserve a really large amount of state to guarantee the reconnect succeeds.

A couple of thoughts for the future I guess.

  • Does the memory goes back down once you stop sending data to the client?
  • If we were to avoid having a limit on prerendered circuits (we special case them, by putting them on a special queue and getting rid of them after lets say 5-10 seconds after we start sending the response to the client).
    • Does the memory growth slow down significantly?
    • Does the memory growth go back to normal after the 5 to 10 seconds? (Meaning we don't hold on to any reference after that).

While the memory grows here by a lot, I think the scenario is similar to a client dispatching events at a high rate, in the sense that the memory grows and then goes back down after there is no more work happening.

One thing we could consider is to have the feature not be on by default and provide good documentation about its usage. There are environments where this feature is very valuable and this threat (if anything) is not a problem, like intranet applications.

While the memory grows here by a lot, I think the scenario is similar to a client dispatching events at a high rate, in the sense that the memory grows and then goes back down after there is no more work happening.

Interesting. That is what Streambase does: they kick out of the queue all the clients that are too slow to connect to the CEP stream and consume the events. Their "queue" is actually LMAX Disruptor design pattern. Performance is then relative to fitting the ring buffer in the L1 cache. If the ring buffer cannot fit in the L1 cache, then it becomes a bit like having an empty beer fridge in the living room, and having to go to the kitchen for refreshment during the Big Game.

Is LMAX Disruptor a usable design pattern for mainstream programmers, though?

@pranavkm Just a question: The Blazor team is concerned that stateful prerendering will make it too easy to initiate a DOS attack just by issuing a bunch of GET requests. But even after removing stateful prerendering, doesn't that just mean an attacker now needs to issue the GET requests AND open the web socket connections (many thousands of times concurrently)?
Granted the difference is the attacker would actually need to hold open those connections. But does this really make a difference to someone wanting to attack a blazor site? It sure was a nice feature and sad to see it go. But we also understand the time constraints and want to see Blazor go live.

Granted the difference is the attacker would actually need to hold open those connections. But does this really make a difference to someone wanting to attack a blazor site? It sure was a nice feature and sad to see it go. But we also understand the time constraints and want to see Blazor go live.

We generally think about DOS attacks as cases where:

  1. A client can easily force the server to do a lot of work - a small request triggers a lot of processing - the details are in the scale factor between client effort and server effort OR
  2. A client can easily exhaust a fixed pool of resources and damage the availability of the server

The concerns about stateful prerendering for about the second case.

The DOS vector here was that your application using stateful prerendering relies on the same set component instances being used:

  1. When the initial HTTP request happens
  2. When a websocket connection is opened

This means that if the same literal .NET objects are not still available between these two points, your application won't work correctly.

We had/have a feature currently to enable this, and it relies on holding a pool of "disconnected" circuits - that are either waiting for their owner to connect, or reconnect. We need to keep some number of circuits in the pool (remember, every new user to your app has to do this dance) and we have to keep them for some reasonable amount of time for your browser to connect or reconnect.

The problem with this is that it's an easy resource for a client to exhaust.

  1. Client does a request
  2. Server does prerendering and creates a circuit
  3. Circuit is stored for the client to connect to later
    4a. Client never opens a web socket, eventually the circuit times out and is evicted due to capacity
    4b. Client opens a web socket, hopefully the circuit is still there!

If we use a small pool size (< 1000) it's trivial for a single client to exhaust. Your server doesn't fall over, but non-malicious users have no hope of getting a legit connection. If you don't limit the pool, or use a large size then a client can easily fill up your memory.

Compare this to what happens with non-stateful prerendering

  1. Client does a request
  2. Server does prerendering and then cleans up all of the state
    3a. Client never opens a web socket, nothing bad happens because the state was already cleaned up
    3b. Client opens a web socket, can always connect because its stateless

If we wanted to keep it, we could provide a set of knobs and dials to configure these sizes, but there doesn't seem to be a safe and universal default. It's hard to come up with a recommendation to use the feature in the end unless all of your traffic is internal and trusted - and we try to avoid shipping features we don't recommend.

We're already have some ideas about how to deliver the same functionality in the future, but we're not taking it on for this release. If you feel really strongly about having a way to initialize components with state during prerendering - please log an issue and show us a little bit of what you want to use it for. The number and kinds of requests we get will help us understand the demand.

@rynowak This made me think of something. Wouldn't long polling be on the same boat as this? Could I not send you a negotiate/connect/startcircuit triplet to achieve the same result?

The difference is I send you 3 requests vs 1, but you don't get to notice you are disconnected until I either

  1. stop sending you ping requests.
  2. a certain amount of time passes.

But for me as an attacker it should be relatively easy to spawn the server with connections and "keep them alive" using long polling.

Should we also consider disabling long polling?

Should we also consider disabling long polling?

In such case, please be merciful, and leave an option to re enable it under our risk.
I'm probably not the only one working on a company with a Windows Server 2008 R2, and that windows version doesn't support websocket =(

@rynowak re these ones:

Right. This is what I'm saying will be removed. We haven't created another way to pass parameters into components for 3.0.

We're already have some ideas about how to deliver the same functionality in the future, but we're not taking it on for this release. If you feel really strongly about having a way to initialize components with state during prerendering - please log an issue and show us a little bit of what you want to use it for. The number and kinds of requests we get will help us understand the demand.

There's a discussion of this going on here #13721