apache/openwhisk-runtime-dotnet

Raw requests to skip JObject conversion

kamyker opened this issue · 12 comments

Another thing that could be added to speed up large requests processing time.

Currently there is:

string body = await new StreamReader(httpContext.Request.Body).ReadToEndAsync();
JObject inputObject = string.IsNullOrEmpty(body) ? null : JObject.Parse(body);

It's fine for small requests but for big ones it will allocate a lot of memory and increase time to complete any function. While that code itself could be improved by reading from Stream directly using current JsonNet (more https://www.newtonsoft.com/json/help/html/Performance.htm in "Optimize Memory Usage") it would be great to somehow skip this process completely, pass httpContext.Request (instead of JObject) and let user do whatever he wants to. In addition to that instead of returning JObject that's converted to string

  await httpContext.Response.WriteResponse(200, output.ToString());
  (...)
  await response.WriteAsync(content);

function could simply return string (or Stream like in AWS Lambda to be set for Response.Body)

So basically high performance mode, in the end function definition:
public async Task<string> Main(HttpRequest request)

Why is it important? I've just notice in Azure Functions it takes almost 2500ms to simply receive 1mb json. No idea what their sdk is doing exactly but that amount of overhead seems weird to me. They also use JsonNet.

With this feature I could use for ex. https://github.com/neuecc/Utf8Json to make part of my function in OpenWhisk run ~5 times faster and allocate ~20 times less memory :).

Edit
Did few tests:

Duration of empty function receiving 1mb request:
Azure (C# - most recent production version): 2400ms
Google/Firebase (javascript): 90-170ms
IBM/OpenWhisk (C# .NET Core 2.2): ~100ms

Seeing this I will definitely use IBM/OpenWhisk instead of Azure

In general the downside of this approach is breaking the openwhisk json in/json out abstraction. Worse directly exposing the http request object while performant breaks the "serverless" abstraction. It would be better to pass in a streamable buffer for i/o. We have moved toward something similar with a piped approach for function i/o.

You can certainly do this with a custom dotnet runtime. I think this would be worthwhile to also raise on the Apache OpenWhisk dev list.

I guess HttpRequest is not needed just body of it - HttpRequest.InputStream
public async Task<string> Main(Stream request)

Indeed - that would be cleaner. It would be nicer if the output was also a stream/symmetric.

Then same definition as AWS has:
public async Task<Stream> Main(Stream stream)

That's probably even faster than string as Stream can be directly assigned to HttpResponse.Body

I'll try to take a look at speeding up the JSON aspect of things. I have been thinking about adding some sugar to the codebase that would allow an object to be defined (rather than JObject) that we would take the incoming JSON and automatically deserialize it to the object and same on the output -- serialize the object to JSON and return it to the calling method. This would allow you to do something like async Task Handler(AddPerson person). Do you think that would be beneficial?

I'll have to look at the Stream side of things a bit more -- I don't think we would want the full raw request stream going to the client. I don't think AWS' implementation for supporting Stream provides the full raw request, but a Stream based on specific events (S3, etc.) and specific to the objects in those requests.

Not sure how exactly AWS (from what I read they expose Stream to their APIGatewayProxyRequest.cs) Tried Azure and they expose full HttpRequest (docs) .

I dislike automatic object conversion as it will use predefined serializer and instead prefer having raw data and do custom (and faster) deserialization. Using other providers that do it I also found it annoying and confusing when any exceptions happen outside my code and sometimes was unable to debug without me deploying whole local runtime solution.

I don't think we would want the full raw request stream going to the client.
Exactly what I'd like as it's simplest and fastest(?) solution. Raw body of the request stream.

How important are these lines?

string body = await new StreamReader(httpContext.Request.Body).ReadToEndAsync();
JObject inputObject = string.IsNullOrEmpty(body) ? null : JObject.Parse(body);
JObject valObject = null;
if (inputObject != null)
{
valObject = inputObject["value"] as JObject;
foreach (JToken token in inputObject.Children())
{
try
{
if (token.Path.Equals("value", StringComparison.InvariantCultureIgnoreCase))
continue;
string envKey = $"__OW_{token.Path.ToUpperInvariant()}";
string envVal = token.First.ToString();
Environment.SetEnvironmentVariable(envKey, envVal);
//Console.WriteLine($"Set environment variable \"{envKey}\" to \"{envVal}\".");
}
catch (Exception)
{
await Console.Error.WriteLineAsync(
$"Unable to set environment variable for the \"{token.Path}\" token.");
}
}
}

Thought that these

string body = await new StreamReader(httpContext.Request.Body).ReadToEndAsync(); 
JObject inputObject = string.IsNullOrEmpty(body) ? null : JObject.Parse(body); 

should be replaced with

JObject inputObject = await JObject.LoadAsync(new JsonTextReader(new StreamReader(httpContext.Request.Body)));

but after testing it seems that JObject.LoadAsync is 2-3 times slower. JObject.Load is few % faster but no async at all so worse solution.

Did few more tests and deserializing into defined object is faster than JObject so @shawnallen85 idea will be faster than current one. Below is 2-4 times faster:

//Type functionInputType somewhere defined

using ( StreamReader sr = new StreamReader( stream ) )
using ( JsonReader r = new JsonTextReader( sr ) )
{
	while ( r.Read() )
	{
		if ( r.Value != null && r.Depth == 1 )
		{
			if ( r.TokenType == JsonToken.PropertyName )
			{
				currentProperty = r.Value.ToString();
				if ( currentProperty == "value" )
				{
					r.Read();
					body = typeof( JsonSerializer ).GetMethods().FirstOrDefault( FindDeserializerMethod )
						?.MakeGenericMethod( functionInputType ).Invoke( new JsonSerializer(), new object[] { r } );

					bool FindDeserializerMethod( MethodInfo m )
					{
						var parameters = m.GetParameters();
						return m.Name.Equals( "Deserialize", StringComparison.OrdinalIgnoreCase )
								&& m.IsGenericMethod
								&& parameters.Length == 1
								&& parameters[0].ParameterType == typeof( JsonReader );
					}
				}
				else
				{
					r.Read();
					if ( r.TokenType == JsonToken.String ) //other types here also? numbers?
					{
						string envKey = $"__OW_{currentProperty.ToUpperInvariant()}";
						StringBuilder sb  = new StringBuilder();

						string envVal = r.Value != null ? r.Value.ToString() : "";
						Environment.SetEnvironmentVariable( envKey, envVal );
					}
					else
					{
						r.Skip();
					}
				}
			}
		}
	}
}

What's nice is that JsonSerializer.Deserialize<T>() also works with JObject so it's compatible with old functions.

@kamyker What method(s) did you use for testing the timing(s) and performance of your suggestions? I would like to reproduce them on my side. Thanks!!

@shawnallen85 some files used, SpeedTest1.cs has all functions:
It wasn't good idea to upload 1mb json to gist so zip download may be better:
gist or zip

Tried BenchmarkNet instead of using Stopwatch and current implementation is the slowest and what's more important in will also slowdown any json usage as user will have to either use slow JObject to navigate or convert JObject to custom object (double conversion):

|                   Method |      Mean |     Error |    StdDev | Ratio |     Gen 0 |     Gen 1 |    Gen 2 | Allocated |
|------------------------- |----------:|----------:|----------:|------:|----------:|----------:|---------:|----------:|
|          TestJObjectLoad |  9.676 ms | 0.0522 ms | 0.0489 ms |  1.00 | 1375.0000 |  781.2500 | 390.6250 |   9.65 MB |
|         TestJObjectParse | 13.806 ms | 0.0793 ms | 0.0741 ms |  1.43 | 1687.5000 | 1203.1250 | 546.8750 |  11.54 MB |
|     TestDictionaryObject | 10.356 ms | 0.0482 ms | 0.0427 ms |  1.07 | 1656.2500 |  984.3750 | 656.2500 |  10.78 MB |
| TestManualToCustomObject |  6.309 ms | 0.0411 ms | 0.0384 ms |  0.65 |  726.5625 |  687.5000 | 500.0000 |   3.72 MB |

~500kb json on i9 9900k, 6ms faster is not much but 7-8mb less memory looks good

Also tried Utf8Json:

|                   Method |     Mean |     Error |    StdDev |    Gen 0 |    Gen 1 |    Gen 2 | Allocated |
|------------------------- |---------:|----------:|----------:|---------:|---------:|---------:|----------:|
| TestManualToCustomObject | 6.318 ms | 0.0305 ms | 0.0270 ms | 726.5625 | 718.7500 | 500.0000 |   3.72 MB |
|                 TestUTF8 | 2.579 ms | 0.0230 ms | 0.0216 ms | 710.9375 | 707.0313 | 500.0000 |   4.26 MB |

In general the downside of this approach is breaking the openwhisk json in/json out abstraction.

I've noticed in IBM console there's "Raw HTTP handling - When enabled your Action receives requests in plain text instead of a JSON body".

I guess it doesn't do anything currently in net, right? That's kind of what I'd like to see being implemented.

https://github.com/apache/openwhisk/blob/master/docs/webactions.md#raw-http-handling

Closing as I've made it in my fork.