fnproject/fn

JSON protocol improvements

Closed this issue · 13 comments

Currently a JSON event looks as follows:

{
"call_id":"01C3E63BQH47WG600000000000"
,"content_type":"application/json"
,"body":"\"Foo\""
,"protocol":{"type":"sync"
,"request_url":"http://localhost:8080/r/foo/testfn"
,"headers":{"Accept-Encoding":["gzip"],"Content-Type":["application/json"],"Fn_call_id":["01C3E63BQH47WG600000000000"],"Fn_deadline":["2018-01-09T19:25:07.489Z"],"Fn_method":["POST"],"Fn_request_url":["http://localhost:8080/r/foo/testfn"],"User-Agent":["Go-http-client/1.1"]}
}
}

(re-formatted:

{
  "call_id":"01C3E63BQH47WG600000000000",
   "content_type":"application/json",
   "body":"\"Foo\"",
   "protocol":{
         "type":"sync",
         "request_url":"http://localhost:8080/r/foo/testfn",
          "headers":
                     {
                        "Accept-Encoding":["gzip"],
                        "Content-Type":["application/json"],
                        "Fn_call_id":["01C3E63BQH47WG600000000000"],
                        "Fn_deadline":["2018-01-09T19:25:07.489Z"],
                         Fn_method":["POST"],
                        "Fn_request_url":["http://localhost:8080/r/foo/testfn"],
                        "User-Agent":["Go-http-client/1.1"]
                    }
           }
}

I think :

  • $.protocol.type should be 'http' (and should maybe be renamed): This should indicate the sort of metdata you are getting in the protocol document (it's unrelated to the fn event type (sync/async)
    I think the assumption is that 'protocol' is specific metadata related to the type of event that was received (e.g. an HTTP call). The assumption is that this may change/extend in future in future (e.g. MQ event Kafka event etc)
  • there should be no "Fn_" headers in the HTTP headers - these should be a faithful representation of the original input headers - this is actually making it harder to identify the correct header set as I have to strip-off fn_ prefixed headers.
  • Relevant "Fn_" headers should be moved up to the top-level message or to the protocol doc.

e.g.

{
  "call_id":"01C3E63BQH47WG600000000000",
   "content_type":"application/json",
   "body":"....",
   "type": "sync", 
   "deadline": "2018-01-09T19:25:07.489Z",
   "protocol":{
         "type":"http",
         "request_url":"http://localhost:8080/r/foo/testfn",
          "method": "GET" ,
          "headers":
                     {
                        "Accept-Encoding":["gzip"],
                        "Content-Type":["application/json"],
                        "User-Agent":["Go-http-client/1.1"]
                    }
           }
}

It would be nice while doing this to move where we modify the headers on the request (where it adds the Fn_ headers) into the http protocol handler, and not earlier. So everyone will have the unmodified original request to work with and they won't end up in these other protocols.

Thikning about it (and changing my mind from previous (sorry @denismakogon ), I think changing the json writer to use standard Go JSON marshalling for the JSON output (or at least guaranteeing that each message is on exactly one line) would be better/easier inside the FDK.

That would rely on upstream to guarantee that messages don't exceed memory, but I think that's implied for async anyway.

Some languages (well node.js anyway) don't have a native json stream parser so that lets an FDK buffer the line before parsing it in one shot.

Python doesn't have built-in non-blocking JSON parser, so, it's not a problem at all.

What if we just include the whole HTTP request into the payload? Just out of curiosity.

What if we just include the whole HTTP request into the payload? Just out of curiosity.

Do you mean the Raw HTTP Frame as a JSON string?

If so, dunno but the current encapsulation seems pretty "as expected" so wouldn't change it

Assuming I've understood - isn't one of the reasons for having the JSON format to avoid complications in reading HTTP frames, so wouldn't that put you back where we started.

hhexo commented

Question: what are the exact semantics of the "deadline" (which is stored as execDeadline in the call struct)?

Is it supposed to be "timeout seconds from the moment the call is created" or "timeout seconds from the moment the call is started"? Because at the moment it is neither of those... it's "timeout seconds from the moment when some code in GetCall runs".

If the semantics were defined in one of the two ways above, the deadline could be derived from existing data and would not need an execDeadline field in the call struct at all. This would simplify things a lot.

"timeout seconds from the moment the call is started"? Because at the moment it is neither of those... it's "timeout seconds from the moment when some code in GetCall runs".

it is intended to be [for sync] starting the clock as soon as we receive a request to run a function. for async, when whoever picks it up goes to run it. explicitly, it is before 'start', since start takes time. and for async, time since created is just not-quite-right, but for sync this is accurate.

wrt where it comes from, it's easily taken from the context, as well (ctx.Deadline()) and provided through CallInfo.

it's unfortunate that switching protocols will change headers' location, but I think I've beaten the dead horse to death again...

hhexo commented

Hm... but the deadline is only in the context because it was put there by taking it from execDeadline which is the field I'd like to refactor out.

If there is a significant instant of time ("when whoever picks it up goes to run it") then I think it should be reflected in the model. For example, models.Call could have a ProcessedAt field in addition to CreatedAt and friends. This would not only make life easier with deadline but it could give us some interesting data: ProcessedAt - CreatedAt gives us the time the async call waited in the queue.

What do you think?

Hm... but the deadline is only in the context because it was put there by taking it from execDeadline which is the field I'd like to refactor out.

I see, it's added to the headers in a separate location, so there's 2 things I guess. we do still want to impose execDeadline in the context deadline itself, so it may move but still needs to be in the same context that it is now to enclose submit. my understanding was we just want to pluck the location where it's set in headers out and expose execDeadline through CallInfo since polluting the header bucket in the picture-esque json format is unsightly.

ProcessedAt

we have call.started_at and call.completed_at which encompass this? we do want to obey the timeout on the call with anything required to build the call, but the deadline does not really change due to this. as it is, queued_time = started_at - created_at so I think this conceptually is the same thing.

to be clear, if we simply pass started_at to the function to compute it's own timeout based on started_at.Add(call_timeout) then this will be different than the actual context timeout that we are imposing on the call.

Does the function need that? Or could it just know that it has 60 seconds to do it's thing?

it needs a timestamp of when the party is over, because computing it locally will be different than the timeout we are imposing upon it from fn's side of things (from an implementation standpoint, this is really straightforward, and should not be a concern..)

hhexo commented

impose execDeadline in the context deadline itself, so it may move but still needs to be in the same context that it is now to enclose submit.
to be clear, if we simply pass started_at to the function to compute it's own timeout based on started_at.Add(call_timeout) then this will be different than the actual context timeout that we are imposing on the call.

IMHO then the solution is to deliberately set the context timeout imposed by us to StartedAt + Timeout, and we remove the discrepancy.

That would involve setting the context deadline in the private submit() just after call.Start(), and it makes sense to me because it feels like we shouldn't be setting a deadline if we haven't even started the call.
Or am I missing something? Is the time taken by Fn operations (mostly getting a slot, but also the database write to set the call as started) supposed to be counted in the function deadline? Maybe that's why we get all those 503s, we take longer than the function timeout to find a slot.

(this is a bit deceiving to a developer: if I set my function timeout to 1 second, I mean that I want my code to run for at most 1 second, I don't mean that the time taken by Fn to start up my container - which I have no control of - plus run my code is 1 second)

hhexo commented

Since I'm veering off the original purpose of this ticket, I have opened #723 to discuss the deadline semantics.