Add Validator Step
Closed this issue · 9 comments
Hi. Have you planned to implement a validation step? This could be very useful in ensuring that the extracted rows are valid. Thank you.
Hello, Do you think of something like this ? 9d4b897
Hi.
First of all, thank for you quick response!
I was thinking of a new Validator namespace with a set of simple validators. Validation should be different from transformation.
Example:
$trimTransformer = new Wizaplace\Etl\Transformers\Trim();
$notEmptyValidator = new Wizaplace\Etl\Validators\NotEmpty();
$callbackValidator = new Wizaplace\Etl\Validators\Callback();
$etl
// ...
->transform($trimTransformer)
->valid($notEmptyValidator, ['columns' => 'id'])
->valid($callbackValidator, ['callback' => function (Row $row) {
if ($row->get('id') <= 0) {
$row->discard();
}
} ])
// ...
run();
What do you think about that?
First thanks for your suggestions :)
I am not sure how a valid
step is different from a transform
one.
I mean, what could you achieve with valid
that you could not with transform
.
$trimTransformer = new Wizaplace\Etl\Transformers\Trim();
$notEmptyValidator = new Wizaplace\Etl\Validators\NotEmpty();
$callbackValidator = new Wizaplace\Etl\Validators\Callback();
$etl
// ...
->transform($trimTransformer)
->transform($notEmptyValidator, ['columns' => 'id'])
->transform($callbackValidator, ['callback' => function (Row $row) {
if ($row->get('id') <= 0) {
$row->discard();
}
} ])
// ...
run();
I have nothing really against this in absolute terms, but as far as i understood, valid
is only a plain alias of transform
, maybe you can elaborate more about your needs.
On another hand, i think your idea of a NotEmpty
validator, (DiscardEmpty
transformer ?) could be a nice addition.
@ecourtial i think you closed this issue a little too soon, i think @qferr was meaning about something different from my RowCallback
idea, what is your opinion about this new step idea ?
@Nicolas-Masson-Wizaplace Thank you. I agree with you, let's create the DiscardEmpty
transformer. I wanted a validator step with a different behavior from a transformer. The goal is not to transform data like a transformer but to validate it. Actually, I don't know where this should be done in the ETL pipeline.
Also, it would be nice to be able to stop the ETL process as soon as an error occurs rather than discarding the row. It would avoid running the system for nothing.
I wanted a validator step with a different behavior from a transformer. The goal is not to transform data like a transformer but to validate it. Actually, I don't know where this should be done in the ETL pipeline.
In an ETL paradigm (see https://en.wikipedia.org/wiki/Extract,_transform,_load), TRANSFORM mean something more generic than mutate data in rows. Validating, normalizing, discarding can be part of this TRANSFORM process.
Also, it would be nice to be able to stop the ETL process as soon as an error occurs rather than discarding the row. It would avoid running the system for nothing.
Keep in mind than ETL is a pipeline (Generator driven in PHP), rows are LOADED as soon they are EXTRACTED (and eventually TRANSFORMED). If you abort an ETL process part of the data are already LOADED.
It looks like you want to do some preprocessing through a first ETL (ET) pipeline to validate the whole EXTRACTED data, then if all is OK, running again the data set through another ETL (EL) pipeline.
Thanks for the details. I appreciate.
Keep in mind than ETL is a pipeline (Generator driven in PHP), rows are LOADED as soon they are EXTRACTED (and eventually TRANSFORMED). If you abort an ETL process part of the data are already LOADED.
What about a process rollback? We could add the rollback()
method in the Step
interface. This way the loaders would be able to remove the added lines. Using a MySQL transaction is a good example for the Insert
loader:
class Insert extends Loader
{
public function initialize()
{
$this->db->pdo($this->connection)->beginTransaction();
}
public function rollback()
{
$this->db->pdo($this->connection)->rollback();
}
public function finalize()
{
$this->db->pdo($this->connection)->commit();
}
}
What do you think?
It looks like you want to do some preprocessing through a first ETL (ET) pipeline to validate the whole EXTRACTED data, then if all is OK, running again the data set through another ETL (EL) pipeline.
Yes I understand. By the way, a best practice is to use staging tables. I could use ETL to load the extracted data into the staging table because in my case I also need to do validation on a dataset, not just a row. Unfortunately, this ETL can't do that because it uses the line-by-line strategy. I could use this ETL to do the first step, loading extracted data in a staging table.
https://www.timmitchell.net/post/2017/06/14/etl-staging-tables/
Here a suggestion :)
https://github.com/wizaplace/php-etl/pull/90/files
If you have any idea of a better implementation feel free to make a PR we are all open to contributions.