amphp/parallel

Dealing with Zombies processes

mlasri-web2 opened this issue · 10 comments

Hi,
I have an API that runs multiple Parallel tasks. In the production/preprod after upgrading to the latest version, we noticed that we are left with a huge amount of zombies processes (each request leaves 8 to 10 zombies, the equivalent number of workers) and we got 5k zombies in less then 1 hour.

foreach($tasks as $key => $task)
{
$executions[$taskKey] = submit(
                    $evaluation,
                    new TimeoutCancellation( $timeout, sprintf("Engine timeout reached! %s",$timeout ))
                );
}

[$errors, $result] = awaitAll(array_map(
                fn (Execution $exe) => $exe->getFuture(),
                $executions
            ));

            foreach ($errors as $key => $response) {
                $this->log('warning',printf(
                    "First pool error with key: %s and message %s \n",
                    $key,
                    $response
                ));
            }

note: the ext pcntl is enabled.

Update 1:

  • i have tried to reduce the max number of workers but no magic result, still got zombies
    $this->worker = new ContextWorkerPool(5);

@kelunik can you please check with me this problem?

thanks in advanced

Hi @mlasri-web2!

I thought the zombie process was fixed in amphp/process, see PosixHandle::waitPid(), which should be invoked by the event-loop callback which is awaiting the exit code pipe for the process. Would you be able to have a look to see if there's any reason the waitPid function is not being invoked?

Hi @trowski, am reopening the subject again as we kept deleting the .sock files left in /tmp/ using a job. temporary solution for these zombies.

The actual situation is we use the Process in two ways, CLI and HTTP. the CLI does close the .sock files properly, but the same service in Symfony controller (POST http endpoint) does leave One .sock in each request: srwxrwxrwx 1 www-data www-data 0 Apr 5 00:53 amp-parallel-ipc-87cff776753e279b1e56.sock.

Note: PosixHandle::asyncWaitPid is invoked

Am still trying to figure out the reason and a solution.

The temporary .sock file should be removed by LocalIpcHub::unlink(), which should be invoked in the destructor. Can you have a look if this function is being invoked, and if not, why that might be happening? We are suppressing errors there, you could try removing that suppression to see if there is an unexpected error (though if the file was created, I'd expect removing it to succeed as well).

@trowski thank you for pointing the LocalIpcHub::unlink() i found out that amp/parallel was in version V2.2.2 which does not include the destructor().

The upgrade works perfectly!
Thank you very much.

@medy36 Thanks for confirming my recent changes fixed the issue!

Hi @trowski the issue of .sock left in /tmp/ is still persisting when executing in CLI. same code.

Hey @medy36! So the fix worked, then stopped working? Nothing has changed that I'm aware of. Would you be able to give me a bit more context and some code to reproduce the issue?

The fix does work! when executing the code in a http request the fix is working.

But when executing the same code in CLI (i am running a job ) the .sock are left.

@trowski should I open a new issue regarding the execution in CLI mode?

@trowski in CLI mode, the _desctruct() is not even reached!

So no unlick is made.

i am struggling at this point, if you have any guidance to propose that would be very helful