general-CbIC/poolex

Wrong error when a genserver timeout

Closed this issue · 15 comments

Hi!

When the function you use in Poolex.run raise a timeout, it returns :all_workers_are_busy. I did a test alexcastano@c84a84b

If I have time, I will finish the PR with the solution. However, I think the timeout option can be misleading. I expected to be the timeout for the whole process, but it is only used to get the worker. The documentation is correct, but you have to check the documentation: https://hexdocs.pm/poolex/Poolex.html#t:run_option/0. Maybe we can find a better name to avoid confusion, like :checkout_timeout?

Ecto used to have :pool_timeout, but it was deprecated. In Ecto :timeout includes the queue time and the query time: https://hexdocs.pm/ecto/changelog.html#backwards-incompatible-changes-1 . This was my expectation for poolex too :)

Thanks

The bug is more complex than I thought at first. If the following line throws a :timeout:

{:ok, pid} = GenServer.call(pool_id, :get_idle_worker, timeout)

The caller process won't die because we have the catch. If the caller process doesn't die later (maybe it is a long live process), caller process will finally receive the message from the pool but won't handle it. This means the pool thinks the worker is busy because the caller process is still alive, but the caller process is not using it. I hope I have made myself clear.

I checked poolboy and it handles this situation. Instead of using a direct Genserver.call, it uses the checkout function. In this function, poolboy sends a cancel message if it times out. I think this is the right solution:

https://github.com/devinus/poolboy/blob/master/src/poolboy.erl#L60-L67

Hi, Alex!

I will check. Thanks for the issue 🙏🏻

Thanks again for describing the problem! I fully agree that there needs to be more information on the call site.

Help, please, to clarify the idea with the common :timeout. Do I understand correctly that we are talking about a timeout determining the total time to complete the task (time to get the worker from the queue + time to execute the function)?

If I understand correctly, the idea sounds sensible. If we treat the pool in the same way as we treat the database, and it is not important for us to define timeouts for different stages, but we need to limit the time of tasks, then the total timeout is what we need :)

I also agree about the second bug, but fixing it after (or together with) changing the timeout logic might be more correct.

Sorry, I mixed up many issues in this ticket.

I looked at the code, and now I have more insights. Let me explain in part:

  1. The current :timeout param was confusing us because we expected a different behaviour, more similar to the Ecto or GenServer. However, you are right. Because the function is executing in the caller process, implementing :timeout could be difficult and misleading. In my opinion, we should just change :timeout option to :checkout_timeout to avoid confusion.
  2. I would implement the checkin and checkout functions in a similar way to the :poolboy does. I would also make them public, it makes sense to me, but it is up to you. That would fix both bugs at once.
  3. I'm unsure if the function run/1 should capture errors from the given function. After all, it will run in the same caller process. It should be up to the user to decide if they want to capture or not the errors.
  4. Similarly, we should standardize the replies in the run/1 function. It should return always {:ok, result} or {:error, type_of_error}. When we return {:ok, _}, the given function could be executed successfully. When we return {:error, _}, it could not be executed for any reason. These reasons are:
    4a. {:error, :checkout_timeout}.
    4b. {:error, {:runtime_error, error}} if you still want to return the runtime errors. I wouldn't do it.

If you don't have time to fix it, I will try to make a PR because we need this fixed too :)

In my opinion, we should just change :timeout option to :checkout_timeout to avoid confusion.

I agree with this idea. Separate checkout timeouts and runtime timeouts sound right :)
In the future, I would also consider the general timeout. This may be more difficult to implement, but on the other hand, it can greatly simplify understanding of the use.

I would also make them public, it makes sense to me, but it is up to you.

To be honest, I don't know how it would be better. I felt there is no real reason for using such interfaces and reducing it to a single and understandable universal run is better. Please tell me what approaches you see in using checkin and checkout when working with a pool.

if you still want to return the runtime errors. I wouldn't do it.

Could you tell me what you propose to return if a runtime error happens? If there are two interfaces, run and run!, we must somehow handle the error in the case of run.

If you don't have time to fix it, I will try to make a PR because we need this fixed too :)

I suggest implementing this within 1 or 2 weeks. But if you need faster, then I would be glad to see a PR from your side 🙏🏻

In my opinion, we should just change :timeout option to :checkout_timeout to avoid confusion.

I agree with this idea. Separate checkout timeouts and runtime timeouts sound right :) In the future, I would also consider the general timeout. This may be more difficult to implement, but on the other hand, it can greatly simplify understanding of the use.

I don't see a straightforward implementation of the general :timeout. So, if it is not a mandatory feature I would prefer not to overcomplicate the code.

I would also make them public, it makes sense to me, but it is up to you.

To be honest, I don't know how it would be better. I felt there is no real reason for using such interfaces and reducing it to a single and understandable universal run is better. Please tell me what approaches you see in using checkin and checkout when working with a pool.

I won't use it for my current case, but I can imagine a situation where a GenServer wants to have more control over the resource. It is riskier, but it is up to the user to use.

if you still want to return the runtime errors. I wouldn't do it.

Could you tell me what you propose to return if a runtime error happens? If there are two interfaces, run and run!, we must somehow handle the error in the case of run.

I understand returning errors when the code is executed in another process, and I would like to know what happened. However, the given function runs precisely in the same process. If the user really has to handle runtime errors, it is possible to do it in the function:

Poolex.run(:my_pool, fn server ->
  try do
    whatever(server)
  rescue 
    ... -> {:runtime_error, :whatever_failed}
  end
end)

Or even outside of the function:

try do
  Poolex.run(:my_pool, &whatever/1)
catch
  _ -> :whatever
end

This approach (user handles the runtime errors) for this library makes more sense to me, but it is not a strong opinion.

I checked the code. This also means we should constantly monitor the caller, not only when waiting for a worker, to release the worker in case of a crash. Similar thing for checkin & checkout functions.

I won't use it for my current case, but I can imagine a situation where a GenServer wants to have more control over the resource. It is riskier, but it is up to the user to use.

It may be worth leaving these functions private. And make them public only when we are sure it is needed.

I understand returning errors when the code is executed in another process, and I would like to know what happened. However, the given function runs precisely in the same process. If the user really has to handle runtime errors, it is possible to do it in the function...

If we define a function without a bang, it must comply with the agreement and not raise runtime errors. But you're right; this is not the necessary complication, and control should be transferred to the call site. It's currently challenging to comprehend the correct way of doing it.

Hi, @alexcastano!

This also means we should constantly monitor the caller, not only when waiting for a worker, to release the worker in case of a crash.

I want to inform you that this trouble was fixed at #56.

Could you please check the latest release? What do you think, the main problem was solved? Do we still need checkin and checkout functions?

Oh, I see. I have rechecked your test cases, and the problem with timeout still needs to be solved.

Hi! Sorry for the delay!

Could you please check the latest release? What do you think, the main problem was solved? Do we still need checkin and checkout functions?

You need the functions if you want to use the library in a Genserver. Suppose your Genserver and the resource from the pool have to send several messages between them because it uses an async interface. Your only option is to allow the Genserver to use checkin and checkout to handle the resource.

I tried to implement it, but it was a significant change because now you must also monitor the clients. I didn't have enough time, so I moved back to :poolboy for the moment. This is my branch: develop...alexcastano:poolex:cancel_checkout

IMO, the safer step would be to copy how :poolboy works. It has some minor bugs like "async shutdown", but it's battle-tested.

I'm going to check your changes now and I will tell you what I think. Please take it as just another opinion. I intend to help you create a stable library that many Elixir users can use in the future.