elgentos/masquerade

Launch subprocesses to speed things up

Closed this issue · 10 comments

Instead of running everything in the same php thread, it'd be nice if the application runs a subprocess for each table or something like that.

What would you propose we use for threading? pthreads is only safe in PHP 7.2+ which will restrict usage to only 7.2, which isn't very widely adopted yet.

Another but hacky way would be to let masquerade call itself through Symfony Process + one of the Parallel Process packages.

While pthreads is a really nice extension, I think it's better to go for Symfony Process. I've seen several applications where runtime calls itself to launch the subprocesses.

Aside from the php 7.2+ support, php-pthreads hasn't been adopted by most distros. I think the extension needs a bit more time to get adopted into the php ecosystem.

I've done some preliminary work in the 24-subprocesses branch. See commit 09ce2ac. I start a parallel process per group here, not per table.

It works, but it immediately outputs the progressbar for every running process (not to mention that it shows the logo and the 'Done anonymizing' outputs for every process as well), making a mess;

 _  _. _ _.    _ .__. _| _  
| | |(_|_>(_||_|(/_|(_|(_|(/_ 
            |
                   by elgentos
                        v0.1.1
                              
._ _  _. _ _.    _ .__. _| _  
| | |(_|_>(_||_|(/_|(_|(_|(/_ 
            |
                   by elgentos
                        v0.1.1

Updating admin_user
     0/26750 [>---------------------------]   0%                              
._ _  _. _ _.    _ .__. _| _  
| | |(_|_>(_||_|(/_|(_|(_|(/_ 
            |
                   by elgentos
                        v0.1.1

Updating sales_creditmemo

Updating email_contact

Updating newsletter_subscriber
  0/14 [>---------------------------]   0%   0/280 [>---------------------------]   0%     0/11616 [>---------------------------]   0%     0/11670 [>---------------------------]   0%                              
._ _  _. _ _.    _ .__. _| _  
| | |(_|_>(_||_|(/_|(_|(_|(/_ 
            |
                   by elgentos
                        v0.1.1
                              
._ _  _. _ _.    _ .__. _| _  
| | |(_|_>(_||_|(/_|(_|(_|(/_ 
            |
                   by elgentos
                        v0.1.1
                              
._ _  _. _ _.    _ .__. _| _  
| | |(_|_>(_||_|(/_|(_|(_|(/_ 
            |
                   by elgentos
                        v0.1.1

Updating sales_invoice

Updating review_detail
    0 [>---------------------------]
Done anonymizing
     0/32941 [>---------------------------]   0%                              
._ _  _. _ _.    _ .__. _| _  
| | |(_|_>(_||_|(/_|(_|(_|(/_ 
            |
                   by elgentos
                        v0.1.1

Updating sales_order
     0/38379 [>---------------------------]   0%
Updating quote
     0/78297 [>---------------------------]   0%

Done anonymizing
 14/14 [============================] 100%^

Cool stuff! It might be better to work with return values in the subprocesses, read them in the master process and create progress bars based on that.

Not sure if that's possible, this is new stuff for me haha.

https://github.com/krakjoe/parallel

This is really nice but unfortunately a PECL extension which limits usage. So a no-go.

FWIW, I wrote a little bash script to make it possible for now:

DATABASE="your_db_name"
MASQUERADE_PLATFORM="magento2"
MASQUERADE_GROUPS=($(bin/masquerade groups --platform "${MASQUERADE_PLATFORM}" | grep '|' | grep -v 'Group' | awk -F'|' '{print $3}' | uniq))

for group in "${MASQUERADE_GROUPS[@]}"
do
    echo "Starting process for group $group"
    screen -d -m -S "anonymize_${group}" bin/masquerade run --platform "${MASQUERADE_PLATFORM}" --database "${DATABASE}" --group "${group}"
done

I created a subprocess per group/table combi;

image

@tdgroot could you test it?

You can clone the https://github.com/elgentos/masquerade/tree/24-subprocesses branch and run bin/masquerade to test it

This is a nicer package with some more options; https://github.com/graze/parallel-process. Using this, we could do something like;

<?php

namespace Elgentos\Masquerade\Commands;

use Graze\ParallelProcess\Event\RunEvent;
use Graze\ParallelProcess\PriorityPool;
use Graze\ParallelProcess\RunInterface;
use Symfony\Component\Console\Command\Command;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Input\InputOption;
use Symfony\Component\Console\Output\OutputInterface;
use Symfony\Component\Process\Process;

class ExampleCommand extends Command
{
    /**
     * @var OutputInterface
     */
    private OutputInterface $output;

    protected function configure()
    {
        $this
            //
            ->addOption('subprocess', 's', InputOption::VALUE_OPTIONAL, 'Whether command is ran as subprocess', false);
    }

    protected function execute(InputInterface $input, OutputInterface $output)
    {
        $this->output = $output;
        if ($input->getOption('subprocess')) {
            sleep(2);
            $output->write(json_encode(['date' => date('d-m-Y H:i:s')]));
            return 0;
        }

        $pool = new PriorityPool();
        $pool->setMaxSimultaneous(5);
        for ($i = 0; $i < $pool->getMaxSimultaneous() * 4; $i++) {
            $pool->add(new Process(['php', 'application.php', '--subprocess=1']));
        }

        array_map([$this, 'addCallback'], $pool->getAll());

        $pool->run();

        return 0;
    }

    public function addCallback(RunInterface $run)
    {
        $run->addListener(
            RunEvent::SUCCESSFUL,
            function (RunEvent $event) {
                $data = json_decode($event->getRun()->getLastMessage(), true);
                $this->output->writeln('The date is ' . $data['date']);
            }
        );
    }
}

The main problem this issue tried to solve was speed, and that was fixed in version 0.3.0. So closing this issue.