elixir-lang/elixir

Track Task callers in process dictionary

josevalim opened this issue ยท 17 comments

Elixir tasks makes it really straight-forward for developers to execute chunks of code concurrently. This is usually done with Task.async/1:

Task.async(fn ->
  ...
end) |> Task.await()

However, in practice, developers should prefer to use Task.Supervisor.async, as the supervision tree gives more visibility over spawned tasks:

Task.Supervisor.async(MySup, fn ->
  ...
end) |> Task.await()

Unfortunately, moving to Task.Supervisor has one big downside: we lose the relationship between the process who started the task and the task itself. That's because in the first example, the parent of the task is the caller process, but in the second example, the parent is the supervisor.

This means instrumentation/monitoring/logging tools and even testing tools like Mox and the Ecto's SQL Sandbox are incapable of detecting the relationship between them.

I propose that we add a new field to the Task process dictionary, called $callers, that tracks exactly the caller of the task. This will allow tools to properly build the relationship between processes and allow testing tools that rely on ownership mechanisms to propagate those effectively without requiring tests to run synchronously.

@fishcakez I would love your opinion on this one.

Wouldn't there be only a single caller for any process? In that case $caller might be a better dictionary key. Apologies if I'm missing something or if this is bike-shedding territory. :)

I can add our use case in logging that I feel can be improved.. In order for this to work for us we would need to implement Process.get(process, key, default). The use case is following - we would like all tasks spawned from a single HTTP request to include request_id in their logs. For doing so we capture metadata (via Logger.metadata/0) and set it inside task (via Logger.metadata(metadata). We might cheat and take $callers from spawned task to figure out what caller request_id is, but currently, that's not possible.

Other way how we use process dictionary is for concurrent testing with mocs. Our HTTP clients read endpoint URL which is replaced with bypass endpoints in some tests, doing so we don't rely on global application state (eg. Application environmnet) and can run those tests concurrently.

Both of them can be solved if we would start propagating some process dictionary keys automatically to spawned processes, but I'm not sure this is a good idea because of overhead and somewhat ambiguous behavior (you still can spawn process by calling Erlang function).

@lackac excellent question. We may have a process that starts a task that starts another task. We would like to track this all the way, hence $callers. The dollar sign is to mirror the current $ancestors. Note that we cannot use $caller to get the direct caller and access the direct caller to get its direct caller because in some cases, such as in Task.start and in Task.async_nolink, the direct caller may be dead, which means this information would be lost.

I have created this example to illustrate what would (I think) be the content of the $callers key (hope it helps):

self() 
#=(1)> #PID<0.2.0>

Process.get(:"$ancestors")  
#=(2)> [#PID<0.1.0>]

Process.get(:"$callers")  
#=(3)> [#PID<0.1.0>]

Task.async(fn ->
  self() 
  #=(4)> #PID<0.3.0>

  Process.get(:"$ancestors")  
  #=(5)> [#PID<0.2.0>, PID<0.1.0>]

  Process.get(:"$callers")  
  #=(6)> [#PID<0.2.0>, PID<0.1.0>]

  Task.Supervisor.async(MySup, fn ->
    self() 
    #=(7)> #PID<0.4.0>

    Process.get(:"$callers")  
    #=(8)> [#PID<0.3.0>, #PID<0.2.0>, PID<0.1.0>]

    Task.async(fn ->
      self() 
      #=(9)> #PID<0.5.0>

      Process.get(:"$callers")  
      #=(10)> [#PID<0.4.0>, #PID<0.3.0>, #PID<0.2.0>, PID<0.1.0>]

      Task.Supervisor.async(MySup, fn ->
        self() 
        #=(11)> #PID<0.6.0>

        Process.get(:"$callers")  
        #=(12)> [#PID<0.5.0>, #PID<0.4.0>, #PID<0.3.0>, #PID<0.2.0>, PID<0.1.0>]
      end)
    end)
  end)
end)

I have some doubts regarding (3), because I'm not sure if the $callers key will be populated at that level. Maybe the first $callers with results would be (6)?

(3) should work but at the top level one likely won't UNLESS it was started as Task.

I like this idea. I think the question is - should this integrate with logger metadata?

@michalmuskala for now I would say no, or at least discuss it elsewhere, but we at least know this makes it possible.

Is it possible to have a memory leak with $callers? It has a subtle difference with $ancestors because if an ancestor exits then the descendants must exit according to OTP. However these are tasks so Task.async would leak $ancestors in the same case right now, as a Task is not aware of its parent / doesn't have a receive loop.

Note that $ atoms in OTP use the $ sign to indicate that the atom itself and any associated term is for internal use only.

@fishcakez very good point. I see two options:

  1. Go ahead anyway since this issue exists for ancestors too

  2. Only track $caller and make sure we reset ancestors to the most immediate parent process unless running inside a supervisor

I would note that proc_lib has the same memory leak, so probably not an issue?

Only track $caller and make sure we reset ancestors to the most immediate parent process unless running inside a supervisor

I think this approach won't work as well because we need to traverse up the callers stack to the first pid matching. It's possible that first or last caller may not be the caller we want to associate with.

For example if an async task performs the ownership checkout in Ecto's SQL sandbox, we don't want to get the sandbox for its parent.

Go ahead anyway since this issue exists for ancestors too

So lets go for this one.

Is someone working on this? I would like to take a jab at it if you can point me in the right direction!

@sheharyarn no, please go ahead!

@josevalim Great! Any pointers/guidance on how and where I should get started?

@sheharyarn Without having looked at the code, I believe the way this would work is that you'd:

  1. find the place in the code where Task.async and Task.Supervisor.Async get to the point of spawning a new process.
  2. Right before that happens, grab the calling process's $callers variable from its process dictionary using Process.get
  3. Prepend self() to the list
  4. In the new spawned Task process, use Process.set to set the $callers variable in the child process.

I am working on this!

Closing in favor of #7995.