luvit/luv

Unable to `cancel` a `luv_work_ctx_t` handle

miversen33 opened this issue · 16 comments

RE: #629

If you create new work via the uv.new_work function and then queue it with the queue_work function, how do you go about "killing" said work?

Some example code to show what I mean

local luv = require("luv")
local work_fn = function()
    local count = 0
    while count > 0 do
        count = count + 1
    end
    return
end
local after_work_fn = function()
    print("Work Complete!")
end
local work = luv.new_work(work_fn, after_work_fn)
work:queue()

As we see in the above code, work_fn will never "exit" and thus run infinitely. I don't see anything in the docs about how to stop "rogue" work processes. Of course, one isn't going to intentionally create code that will run forever, but it is certainly possible that code could end up doing so. For my use case, I am looking more from the angle of "code is running too long and user wants it to stop". I don't see how I would go about doing that.

Am I missing something?

Little bit of diving into the libuv docs, I dug up this tidbit

Initializes a work request which will run the given work_cb in a thread from the threadpool. Once work_cb is completed, after_work_cb will be called on the loop thread.

    This request can be cancelled with uv_cancel.

However, when modifying the above code to try and use uv.cancel, I get the following error

Error executing luv callback:
/tmp/test.lua:35: bad argument #1 to 'cancel' (uv_req expected, got userdata)
stack traceback:
        [C]: in function 'cancel'
        /tmp/test.lua:35: in function </tmp/test.lua:34>

Note, updated code

local luv = require("luv")
local work_fn = function()
    local count = 0
    while count > 0 do
        count = count + 1
    end
    return
end
local after_work_fn = function()
    print("Work Complete!")
end
local work = luv.new_work(work_fn, after_work_fn)
work:queue()
local timer = luv.new_timer()
timer:start(100, 0, function()
    luv.cancel(work)
end)

Digging deeper still, the cancel doc actually says

Cancel a pending request. Fails if the request is executing or has finished executing.

    Returns 0 on success, or an error code < 0 on failure.

    Only cancellation of uv_fs_t, uv_getaddrinfo_t, uv_getnameinfo_t, uv_random_t and uv_work_t requests is currently supported.

    Cancelled requests have their callbacks invoked some time in the future. It’s not safe to free the memory associated with the request until the callback is called.

    Here is how cancellation is reported to the callback:

        A uv_fs_t request has its req->result field set to UV_ECANCELED.

        A uv_work_t, uv_getaddrinfo_t, uv_getnameinfo_t or uv_random_t request has its callback invoked with status == UV_ECANCELED.

If I am reading that correctly, does that mean that you can't cancel a work_ctx_t handle?
Noting specifically that the new_work calls out that you do not get a work_t handle.

uv.new_work(work_callback, after_work_callback)

Parameters:

    work_callback: function
        ...: threadargs passed to/from uv.queue_work(work_ctx, ...)
    after_work_callback: function
        ...: threadargs returned from work_callback

Creates and initializes a new luv_work_ctx_t (not **uv_work_t**). Returns the Lua userdata wrapping it.

I'm not too familiar with the uv_work_t bindings, but from your comments it definitely seems like this could be an oversight in our bindings. My initial reaction is:

  • luv_cancel should test for luv_work_ctx and call something like luv_work_cancel instead (implemented in work.c)
  • luv_work_cancel should be bound as cancel in the luv_work_ctx metatable (like queue is here)

My C skills are quite dull, but I could see if I can get that implemented. It will likely take me alot longer than someone with more intimate knowledge with the luv codebase though. I do agree with your reaction. I spent a bit of time digging through the source to see if there was a way to convince lua to cancel the luv_work_ctx, but I didn't see too much that I was familiar with.

That said, this is something that will prevent my work going forward so I am willing to put some (hopefully guided) effort into rectifying this, and maybe sharpening my C skills in the process lol.

EDIT: Ignore this, see comment below

In looking at it a bit more, I think this might be a more involved change than I was initially thinking it would be. luv_work_ctx_t doesn't store enough information to retrieve any associated uv_work_t, so we won't have anything to pass the uv_cancel. That is, in this example:

local work = luv.new_work(work_fn, after_work_fn)
work:queue()
-- the 'work' userdata doesn't have any reference to the uv_work_t created during 'queue()',
-- so there's no way to implement 'cancel()' which needs a uv_work_t
work:cancel()

I think a complete solution will involve returning a luv_work_t from luv_new_work and likely folding luv_work_ctx_t into it as necessary.

A quicker fix might be to add a uv_work_t* field to luv_work_ctx_t, initialize it to NULL, and then populate it in luv_queue_work (and make sure we can't run into any use-after-free scenarios).

Actually, I think the right solution might be to make queue return something that can be canceled, as one return of new_work can be queued multiple times, so it seems like the returns of the queue calls are the things that you'd actually want to be cancelable.

In that case, the fix would be to make luv_queue_work return either a luv_req_t (if possible, this would get cancel support for free) or a luv_work_t (if necessary, would need to add a bit of special handling for this).

From the Lua side of things, usage would look something like:

local work = luv.new_work(work_fn, after_work_fn)
local foo = work:queue()
foo:cancel()

I took a bit of a deeper peek under the hood last night after I saw your comments, and while I agree with the latter opinion (of queue returning a cancelable object by default), I am unfortunately a bit over my head on trying to provide any sort of code change to resolve this I fear :(

I can read the code generally fine but I am failing to grasp how the lua pieces of this interact with the c code under the hood (due to my not knowing how lua is implemented within C in general). I am going to step back on this and follow along the chain, but I don't believe I'll have any useful code I can provide here

No worries, I've started attempting an implementation in my fork here: https://github.com/squeek502/luv/tree/cancelable-work

When testing things I realized that the example in the first comment in this thread won't work even if cancel is fixed for uv_work_t. From the threads guide:

Only tasks that are yet to be started can be cancelled. If a task has already started executing, or it has finished executing, uv_cancel() will fail.

Further, uv_cancel can fail even if it's called before the loop is ever run:

local luv = require("luv")
local work_fn = function(i)
  print(i)
  return i
end
local after_work_fn = function(...)
  print("after_work_fn", ...)
end
local work = luv.new_work(work_fn, after_work_fn)

-- queue 16 instances
local reqs = {}
for i=1,16 do
  table.insert(reqs, work:queue(i))
end

-- cancel them all
for i, queued in ipairs(reqs) do
  print(i, queued:cancel())
end

-- now run the loop
luv.run()

Will output something like (the added comments are annotations to make it more understandable):

-- the first four queued can't be canceled
1	nil	EBUSY: resource busy or locked	EBUSY
2	nil	EBUSY: resource busy or locked	EBUSY
3	nil	EBUSY: resource busy or locked	EBUSY
4	nil	EBUSY: resource busy or locked	EBUSY
-- the rest are successfully canceled
5	0
6	0
7	0
8	0
9	0
10	0
11	0
12	0
13	0
14	0
15	0
16	0
after_work_fn -- these are all the canceled `uv_work_t`
after_work_fn
after_work_fn
after_work_fn
after_work_fn
after_work_fn
after_work_fn
after_work_fn
after_work_fn
after_work_fn
after_work_fn
after_work_fn -- end successfully canceled `uv_work_t`
2 -- this is work_fn being called for the 2nd queued work
after_work_fn	2
3 -- this is work_fn being called for the 3rd queued work
1 -- this is work_fn being called for the 1st queued work
after_work_fn	3
after_work_fn	1
4 -- this is work_fn being called for the 4th queued work
after_work_fn	4

So, your use case unfortunately doesn't seem to fit this API.

Well thats no good :( What is the recommended way to handle something like my initial use case (work that needs to be stopped "mid flight")?

Libuv is an event loop, while it's running your code it has no control. cancel was never intended to be able to stop any non-IO request mid execution (Libuv handles all IO itself).

Libuv is an event loop, while it's running your code it has no control

That doesn't make sense. Libuv is an event loop, any subprocess it starts should be terminable (even if it is forcible).

That said it does appear that cancel is not quite what I wanted here

the thread doc states the following

uv_cancel() is useful to cleanup pending tasks if the user requests termination. For example, a music player may queue up 
multiple directories to be scanned for audio files. If the user terminates the program, it should quit quickly and not wait until all 
pending requests are run.

So that brings me back to my initial request then. How do I kill a luv_work_ctx_t handle?

Libuv is an event loop, any subprocess it starts should be terminable

Libuv is not preemptive, libuv is never in control when user code is running.

The paragraph immediately preceding that documentation notes the limitation:

Only tasks that are yet to be started can be cancelled. If a task has already started executing, or it has finished executing, uv_cancel() will fail.

uv_cancel() is useful to cleanup pending tasks if the user requests termination.

You can cancel work requests that have not yet begun, but once the request has started it must run to completion. Libuv runs work requests by calling the work function, at which point the only way it regains control is when that function returns.

You can cancel work requests that have not yet begun, but once the request has started it must run to completion. Libuv runs work requests by calling the work function, at which point the only way it regains control is when that function returns.

To be clear then, this means there is no way to kill an in flight luv_work_ctx_t? It seems a bit odd to me that a framework that provides an event loop doesn't have some sort of protection against runaway work being done in its loop.

I am not completely set on using the work context for this, it just seemed to be the best fit for what I needed. I believe luv also offers true OS threads though that seemed much more expensive for what I am looking for (my framework I am building cannot make guarantees about the work being done and needs a way to kill (at a user's request) active running work).

I want to be clear, I am not against using an alternative. I am asking what that alternative may be if using luv_work_ctx_t is not the answer. Its clear that cancel wont work. Cool, so what should I look into?

It seems a bit odd to me that a framework that provides an event loop doesn't have some sort of protection against runaway work being done in its loop.

It's not all that uncommon, stopping an in-flight work request requires preempting, which is quite difficult to do (nearly impossible at Libuv's level).

As far as I'm aware Libuv doesn't provide any way to stop, kill, or otherwise halt a spawned thread; much less do the same for an individual work request (which will be running on one of Libuv's threadpool threads).

If you cannot guarantee that the work will ever finish, you'll either need to delegate it to its own thread (at which point the os scheduler will handle it behaving poorly) or instrument it with your own code that will preempt it if it runs for too long or something similar. For the latter, you can use debug.sethook.

If you cannot guarantee that the work will ever finish, you'll either need to delegate it to its own thread (at which point the os scheduler will handle it behaving poorly) or instrument it with your own code that will preempt it if it runs for too long or something similar. For the latter, you can use debug.sethook.

Well that's not the answer I was hoping for lol. I appreciate the direction. It appears that the request to get cancel implemented is still needed as it doesn't exist in the luv implementation of libuv, however it also sounds like getting that implemented won't address my specific issue. Should I close this thread out or does the luv team want this to stay open to track the luv_work_ctx_t cancel work?

Leave it open, we definitely should make our cancel bindings work with uv_work_t.