wizacode/php-etl

Add Validator Step

Closed this issue · 9 comments

qferr commented

Hi. Have you planned to implement a validation step? This could be very useful in ensuring that the extracted rows are valid. Thank you.

Hello, Do you think of something like this ? 9d4b897

qferr commented

Hi.

First of all, thank for you quick response!

I was thinking of a new Validator namespace with a set of simple validators. Validation should be different from transformation.

Example:

$trimTransformer = new Wizaplace\Etl\Transformers\Trim();
$notEmptyValidator = new Wizaplace\Etl\Validators\NotEmpty();
$callbackValidator = new Wizaplace\Etl\Validators\Callback();

$etl
    // ...
    ->transform($trimTransformer)
    ->valid($notEmptyValidator, ['columns' => 'id'])
    ->valid($callbackValidator, ['callback' => function (Row $row) {
        if ($row->get('id') <= 0) {
            $row->discard();
        }
    } ])
    // ...
    run();

What do you think about that?

First thanks for your suggestions :)

I am not sure how a valid step is different from a transform one.
I mean, what could you achieve with valid that you could not with transform.

$trimTransformer = new Wizaplace\Etl\Transformers\Trim();
$notEmptyValidator = new Wizaplace\Etl\Validators\NotEmpty();
$callbackValidator = new Wizaplace\Etl\Validators\Callback();

$etl
    // ...
    ->transform($trimTransformer)
    ->transform($notEmptyValidator, ['columns' => 'id'])
    ->transform($callbackValidator, ['callback' => function (Row $row) {
        if ($row->get('id') <= 0) {
            $row->discard();
        }
    } ])
    // ...
    run();

I have nothing really against this in absolute terms, but as far as i understood, valid is only a plain alias of transform, maybe you can elaborate more about your needs.

On another hand, i think your idea of a NotEmpty validator, (DiscardEmpty transformer ?) could be a nice addition.

@ecourtial i think you closed this issue a little too soon, i think @qferr was meaning about something different from my RowCallback idea, what is your opinion about this new step idea ?

qferr commented

@Nicolas-Masson-Wizaplace Thank you. I agree with you, let's create the DiscardEmpty transformer. I wanted a validator step with a different behavior from a transformer. The goal is not to transform data like a transformer but to validate it. Actually, I don't know where this should be done in the ETL pipeline.

Also, it would be nice to be able to stop the ETL process as soon as an error occurs rather than discarding the row. It would avoid running the system for nothing.

@qferr

I wanted a validator step with a different behavior from a transformer. The goal is not to transform data like a transformer but to validate it. Actually, I don't know where this should be done in the ETL pipeline.

In an ETL paradigm (see https://en.wikipedia.org/wiki/Extract,_transform,_load), TRANSFORM mean something more generic than mutate data in rows. Validating, normalizing, discarding can be part of this TRANSFORM process.

Also, it would be nice to be able to stop the ETL process as soon as an error occurs rather than discarding the row. It would avoid running the system for nothing.

Keep in mind than ETL is a pipeline (Generator driven in PHP), rows are LOADED as soon they are EXTRACTED (and eventually TRANSFORMED). If you abort an ETL process part of the data are already LOADED.

It looks like you want to do some preprocessing through a first ETL (ET) pipeline to validate the whole EXTRACTED data, then if all is OK, running again the data set through another ETL (EL) pipeline.

qferr commented

Thanks for the details. I appreciate.

Keep in mind than ETL is a pipeline (Generator driven in PHP), rows are LOADED as soon they are EXTRACTED (and eventually TRANSFORMED). If you abort an ETL process part of the data are already LOADED.

What about a process rollback? We could add the rollback() method in the Step interface. This way the loaders would be able to remove the added lines. Using a MySQL transaction is a good example for the Insert loader:

class Insert extends Loader
{
    public function initialize()
    {
        $this->db->pdo($this->connection)->beginTransaction();
    }

    public function rollback()
    {
        $this->db->pdo($this->connection)->rollback();
    }

    public function finalize()
    {
        $this->db->pdo($this->connection)->commit();
    }
}

What do you think?

It looks like you want to do some preprocessing through a first ETL (ET) pipeline to validate the whole EXTRACTED data, then if all is OK, running again the data set through another ETL (EL) pipeline.

Yes I understand. By the way, a best practice is to use staging tables. I could use ETL to load the extracted data into the staging table because in my case I also need to do validation on a dataset, not just a row. Unfortunately, this ETL can't do that because it uses the line-by-line strategy. I could use this ETL to do the first step, loading extracted data in a staging table.

https://www.timmitchell.net/post/2017/06/14/etl-staging-tables/

If you have any idea of a better implementation feel free to make a PR we are all open to contributions.