nervous-systems/cljs-lambda

Async functions become unreponsive if Lambda global timeout occurs

joekiller opened this issue ยท 35 comments

When invoking an async promise if Lambda kills the function prior to the async promise being realized on subsequent calls the function is hung and doesn't enter the handler until either a new Lambda zip is uploaded or the Lambda function times out on AWS's infrastructure internally.

Using https://github.com/Type-Zero/promise-timeout-hang you can create two functions promise-hang and promise-timeout. When promise-timeout is invoked, the function will print Starting function and then throw a timeout error. Subsequent invocations behave the same.

Invoking promise-hang will print Staring function and then timeout via the Lambda timeout. Subsequent executions will not run the handler and you will not see Starting function. Basically at this point the function is useless.

I'm making this issue here because this is the top level where we encountered it but reading around the internet like http://theburningmonk.com/2016/05/aws-lambda-constant-timeout-when-using-bluebird-promise/ it appears that the problem lies in how bluebird is handling async promises.

Any insight is welcome. I'm not quite up to debugging this all the way down.

moea commented

@joekiller Thanks for reporting this - I haven't run into it, as the functions I have in production are all using the legacy core.async behaviour. The obvious naive, miserable, expedient hack would be (-> ctx cljs-lambda.context/time-remaining (* MAGIC-FACTOR)), and using that as the timeout for the top-level promise in user code.

Obviously that sucks, and given that the stakes are so high, you'd want to have a really conservative MAGIC-FACTOR, like .8 or something. Also, putting it in cljs-lambda would be no fun, b/c as per the link you gave, it will alter the user-facing behaviour of timeouts.

I'm with you that trying to get to the bottom of this some other way seems like the right approach, and I'm motivated to have Bluebird promises work w/ cljs-lambda.

The difficulty in debugging it (locally, ideally) is that I'm not 100% clear on how the Lambda harness terminates an invocation - which is something I'll look into. Given that, it'd probably be trivial to reproduce in a short Javascript file, to take Lambda out of the equation. I'll post any updates here.

moea commented

N.B. from looking through /var/runtime/node_modules/awslambda/index.js (via docker run -t -i --entrypoint sh lambci/lambda), it looks like a native library, liblambdaruntime.so (linked w/ libpthread FWIW) is used to manage the details of invocations, so it's likely they're doing something cute on the edges of the Node process. That's going to make it a little harder to reproduce, but there's only so many different things they could be doing, I guess.

Doh, obviously from you second remark you are using docker-lamda

moea commented

Annoyingly, I can't reproduce this in Javascript, using the code from that blog post. I tried old versions of bluebird, incl. the one used by promesa, as well as invoking via API Gateway, and reducing the memory allocation to the minimum - it logs hello every time.

moea commented

After some headache, I can get it to happen in JS if I set the function timeout to 1 second and include a module-level statement to generate a million random bytes and encode them to hex. Without that statement (or, presumably, something taking an equivalent amount of time, like, say, loading some compiled Clojurescript) I can't trigger it.

Doing it with Clojurescript, I can get it to fail with this:

(def ^:export promise-hang
  (js* "function(input, ctx) { console.log('Starting'); require('bluebird').delay(5000).then(function() {ctx.succeed(nil, \"Hello\") }); }"))

(that's w/ :optimizations :none, with no dependencies except for clojure & clojurescript, w/ bluebird in the npm deps - i.e. it's not an interaction w/ the cljs-lambda runtime library, or promesa, it seems to be a combination of bluebird and the amount of time it takes to load the handler module).

Interesting.

I'm attempting to make a pure JS version like in the article as well. Depending on where this goes we might raise an issue with AWS for assistance but ideally we can solve this without their involvement.

moea commented

In case this is helpful:

handler.js

const P = require('bluebird');

for (var i = 0; i < 100; i++) {
  require('crypto').randomBytes(10000).toString('hex');
}

module.exports.hello = function(event, context, cb) {
  console.log('hello');
  P.delay(3000).then(() => context.succeed()).catch(err => context.fail(err));
}

serverless.yml

service: hello

provider:
  name: aws
  runtime: nodejs4.3

package:
  include:
     - node_modules/**
functions:
  hello:
    handler: handler.hello
    timeout: 1
    memory: 128
    events:
      - http:
          path: ok
          method: get

package.json

{
  "dependencies": {
    "bluebird": "^3.4.7"
  }
}
moea commented

(The above exhibits the behaviour w/ serverless invoke -f hello -l - doesn't need to be used via API Gateway.

In terms of fixing it, I've tried a few things:

  • Promise.setScheduler(f => setTimeout(f, 0))
  • const Promise = require('bluebird').getNewLibraryCopy() (at module level, as well as at the start of the handler)
  • process.on('unhandledRejection', function() {})

This is totally an AWS issue, but yeah, they would take forever.

moea commented

Thanks. Note that in the example I gave above, it should be memorySize, I think, not memory - it will always work w/ Serverless' default memory size of 1024.

moea commented

I'm out of time for today, a couple of notes:

I don't think this is related to Bluebird - I think that, bizarrely, toplevel code just takes longer to run the second time around - so given an example with either a very small timeout or significant module-level work, the invocation is terminated before the function is called.

Take this guy, with no promises:

const init = Date.now();

for (var i = 0; i < 200; i++) {
  require('crypto').randomBytes(10000).toString('hex');
}

module.exports.hello = function(event, context, cb) {
  console.log(Date.now() - init);
  setTimeout(
    function() {
      context.succeed('Done');
    }, 3000);
}

deploying and then running 3 times, I get, 61, 959, and then nothing/timeout.

If I move the first two chunks inside the handler, I get values from 500-timeout (averaging to about the same as the other guy), with no definite jump in the magnitude between the first invocation and the rest. In both cases, it doesn't wedge until you redeploy - if the duration of the work is very close to the timeout value, it'll work sometimes if you repeatedly invoke.

I'm not sure why the elapsed time is 15x smaller for the first invocation only when the work is happening at the toplevel.

moea commented

In terms of why promise-hang behaves differently from promise-timeout - I don't yet have an answer, but so far all of the minimal examples I've tried have had errors reducible to the issue I described above.

The init = Date.now() is outside the handler context, so that is resolved at the initial container start up and the context carries over on subsequent executions (http://blog.rowanudell.com/database-connections-in-lambda/) so that is why the numbers get bigger when it runs again.

I'm still digging into why a subsequent execution doesn't work but it seems like it has something to do with container re-use and thawing.

I side note, I think the library needs to be updated to use the new callback function for lambda conclusion. This may actually fix lingering promises problem. Because using context method calls freezes the execution immediately while using the callback function the docs indicate that other methods are allowed to complete. (Punchline in bold):

When the context methods are called, AWS Lambda freezes the Node process immediately, without waiting for the event loop associated with the process to empty. The process state and any events in the event loop are frozen. When the function is invoked again, if AWS Lambda re-uses the frozen process, the function execution continues with its same global state (for example, events that remained in the event loop will begin to get processed). However, when you use callback, AWS Lambda continues the Lambda function execution until the event loop is empty. After all events in the event loop are processed, AWS Lambda then freezes the Node process, including any state variables in the Lambda function.

I found this after reading up on container reuse in Lambda. I wanted to look more into how our context callbacks were working. I googled lambda context.fail and one of the first hits I found was Using the Earlier Node.js Runtime v0.10.42 - AWS Lambda which mentions in the transitioning to new nodejs runtime that the latest lambda wants us to use the newer callback method instead of using context object calls.

moea commented

@joekiller I'd have noticed if that was what was happening - in none of my invocations did init have the same value as a previous call:

const init = Date.now()
...
module.exports.handler = function(...) {
console.log([init, Date.now() - init]);
...
} 
2017-02-14 18:43:03.583 (+00:00)        6ac274d6-f2e5-11e6-908c-25be2e820445    [1487097783535, 48]
2017-02-14 18:43:06.964 (+00:00)        6c7f4000-f2e5-11e6-a731-ab7d1ae05307    [1487097786362, 602]
2017-02-14 18:43:09.502 (+00:00)        6e12fee6-f2e5-11e6-b6f9-69ccac0aab6d    [1487097788983, 519]
2017-02-14 18:43:11.782 (+00:00)        6f730478-f2e5-11e6-a9bf-11057c99bb17    [1487097791283, 499]
2017-02-14 18:43:14.220 (+00:00)        70d2bb45-f2e5-11e6-8ea9-c1393fa9b2b8    [1487097793584, 636]

If you want to check whether using the callback affects your function, something like

(defn ^:export promise-hang [x y cb]
  (prn "Starting function")
  (-> (p/delay 5000)
       (p/then (fn [] (cb nil "Finished function")))))

I'm not sure that it would - in your example none of the context methods are called, because the function is terminated.

I think you are correct that the example will not use the callback either. The delay will never return if the timeout always kills the process first.

I made the example the way I did because it appeared to make the same behavior each time. In our real code though we expect nearly all code to run faster than the timeout but believe that the legacy way of calling context methods causes lingering promises to go unresolved which makes the code unsreponsive after the initial execution.

moea commented

I can publish a snapshot tomorrow which uses the callbacks, if you want to test with that, but regarding the above, a pure Javascript example which captures the issue you're talking about would be helpful to me.

moea commented

@joekiller another example, in the context of your project:

(defn ^:export promise-hang [input ctx cb]
  (prn "Starting function")
  (-> (js/Promise.
       (fn [resolve]
         (js/setTimeout #(resolve) 3000)))
      (.then (fn [] (cb nil "OK")))))

That's not using bluebird, but rather the native node promise, and none of the cljs-lambda wrappers are in play - this function will be directly invoked by Lambda, with the only other work being the loading of Clojurescript & libraries - and it still hangs with nothing printed - 3 seconds is apparently not long enough to load all of those namespaces after the first invocation.

How about a snapshot that leaves all context or callback executions to the client instead. I think the root problem is having the context calls always happening when the lamda handler finishes. Dropping out of the function without a signal is valid. It does trigger a error on Amazon's side but that is expected.

If the context.[success, done, error] is called then code execution is frozen but outstanding promises may be realized on subsequent executions on the same container. Further because of the single state machine underneath if multiple promises or callbacks stack up their variables to call with are overridden by the last executions context.

Even if the code was changed to resolve the promises prior to exit, ie using the callback, with the default !stop invoking callback or context methods, the interface is taking away that 4th exit option of dropping out of the function with no calls.

Forgive me if I'm misunderstanding the code.

PS the following is the repo I was testing for JavaScript. I think the linked commit best matches what we are trying to test. Also note that it appears that const init = Date.Now() can actually change as it appears that const gets compiled into a mutable reference resolved in the handler. I wrote a file to disk to test this. Also sometimes the const does resolve in the global content sometimes when I tested. I can't really explain it other than some strange state side effect.

https://github.com/Type-Zero/promise-timeout-hang-js/blob/ff27a2305b3bf4c474e4ee9ccd4f72fc3e1a43cf/index.js

moea commented

@joekiller You can do that already, per my most recent example above - you just define a regular clojurescript function w/ :export metadata, and have an entry for it in the cljs-lambda -> functions section of your project file. You get the arguments directly from Lambda, which treats it as a bare Javascript function. If you want the arguments in exactly the same shape as you get them from cljs-lambda, then you can do:

;; this won't pass you the cb argument, but it's also not doing anything complex at all
(def ^:export handler
  (cljs-lambda.util/wrap-lambda-fn (fn [ctx event] ...)))

Neither of those approaches touches the context object for you - deflambda and async-lambda-fn, the thing it uses, are specifically for the case where context management is being delegated to the library.

Okay, so currently we have something like:

(defn handle-this [{:keys [event context] :as request}]...)

(def app
  (-> handle-this
      wrap-that))

(def ^:export myfunction
  (make-lambda app))

Are you saying to instead do the following to prevent touching of the context object?

(def ^:export myfunction
  ((cljs-lambda.util/wrap-lambda-fn app))

If that is true, then that'll probably work for us and I'd be in favor of updating those context calls to callbacks.

moea commented

@joekiller, yeah, basically:

;; wrap-lambda-fn takes a function, returns a function, so you don't need to invoke its
;; return value, only to bind it to a name:
(def ^:export myfunction (cljs-lambda.util/wrap-lambda-fn some-fn-of-two-args))

On the other thing, looking at the AWS code, there is a difference in how the context/callback is handled:

function invoke(handler, event, baseContext, finish) {
    let waitToFinish = true;
    let consumed = false;
    const callback = (err, data) => {
        if (consumed) {
            return;
        }
        consumed = true;
        finish(err, data, waitToFinish);
    };

    const context = Object.assign({
        set callbackWaitsForEmptyEventLoop(value) {
            waitToFinish = value;
        },
        get callbackWaitsForEmptyEventLoop() {
            return waitToFinish;
        },
        done : function(err, data) {
            waitToFinish = false;
            callback(err, data);
        },
        succeed : function(data) {
            checkExpectedArgRange('succeed', arguments, 0, 1);
            context.done(null, data);
        },
        fail : function(err) {
            checkExpectedArgRange('fail', arguments, 0, 1);
            context.done(isUndefinedOrNull(err) ? '__emptyFailParamBackCompat' : err, null);
        }
    }, baseContext);

    handler(event, context, callback);
}

i.e. if you happen to have work in the event loop, by default, the callback will wait for it to complete, or timeout. Calling context.done etc. will not - that is the only difference, but it may be germane to what you're trying to do.

My offer about changing cljs-lambda to implicitly use the callback parameter still stands (and giving it as a third parameter to wrap-lambda-fn inputs which support that arity)- that's about all I can do.

I forgot the actual code that made the lambda.on my.mobile so please excuse the crude screenshot
screenshot_20170214-200524 2

lambda is cljs-lambda.util

I'll try this tomorrow.

Cool. Try updating the code as I think it'll fix the black hole anyway but I think we'll invoke as you suggested after testing.

moea commented

@joekiller If by black hole, you mean any of the (apparently) sticky timeouts demonstrated in the examples posted here or in the linked blog post, I personally don't think so - and I'm easily able to recreate them using the snapshot, or just using Javascript code which invokes the callback. It may fix some other issue you're seeing which isn't captured in the examples, though.

io.nervous/cljs-lambda 0.3.5-SNAPSHOT (i.e. runtime library)

If you are managing completion yourself (by exporting the result of wrap-lambda-fn), there's a :handler-callback key in the context record:

((:handler-callback ctx) (js/Error. "Oops"))
((:handler-callback ctx) nil {:x 1 :y 2}))

If using deflambda or async-lambda-fn, the callback will be used to signal completion. ctx/done! etc. will obviously continue to do the same thing as their js counterparts, if invoked explicitly.

Okay we tested using the callback, the lightweight wrapper and eventually pulled out cljs-lambda all together and it still happened. Eventually we pulled bluebird out and everything seemed to work again. So we have filled petkaantonov/bluebird#1341

moea commented

OK - I hope you get some resolution. Note that while I can reproduce the issue when that example function's deployed with a memory size of 128mb and timeout of 1 second, I can also reproduce it if I change the first line to:

const ignore  = require("bluebird").noConflict();

So that the handler body is using the native promise. If you pull out bluebird, but you don't replace its import with something which takes a similar amount of time to load, then all variables are not being controlled for.

Hi @moea, so after much head scratching, everything works fine after bumping the ram on the lambda function. We are going to dig into it a little more and ask AWS why resumption of a previously frozen function that executed would then cease to work.

Hi i'm closing this. I can finally reproduce the problem in pure JS. Check it out here: https://github.com/Type-Zero/promise-timeout-hang-js

The short is it seems if your code outside of the handler consumes more than ~16% of your memory, if NodeJS suffers a timeout, the function is hosed.

The crypto code outside the handler proved with some tweaking to consistently kill the function.

Oh my god. Just in case anyone else lands here and makes my mistake: make sure you haven't accidentally deployed AWS.config.update(...) using a localhost endpoint instead of an actual AWS region-specific endpoint. That can present with similar symptoms as above (say, if the only async function you're using is accessing a DynamoDB document client) but for utterly different reasons.

@joekiller Despite my stupidity, I've managed to learn a lot from this discussion. Thanks for posting it.

FYI AWS finally fixed the memory problem.