GoogleCloudPlatform/cloud-data-quality

Allow multiple row_filters and params to row_filters

bogo96 opened this issue · 4 comments

Is there any reason that not allow multiple row_filters and params like rule_id?
While using clouddq, i got some needs of this in some situation.

if allow params, then row_filters can be used in general. I mean one row_filter can be used in many entities.
Or sometime, we want to filter by a column other than the one used for rule_binding.
I suggest modifying like below, then filtering email row_filter can be use any entity with any given column.

row_filters:
  NONE:
    filter_sql_expr: |-
      True

  DATA_TYPE_EMAIL:
    params: |-
      - column
    filter_sql_expr: |-
      $column = 'email'

rule_bindings:
  T2_DQ_1_EMAIL:
    entity_id: TEST_TABLE
    column_id: VALUE
    row_filter_ids:
      - DATA_TYPE_EMAIL:
           column: contact_type
    rule_ids:
      - NOT_NULL_SIMPLE
      - REGEX_VALID_EMAIL
      - CUSTOM_SQL_LENGTH:
          upper_bound: 30
      - NOT_BLANK
    metadata:
      brand: one

Is there any way to use it in this situation? if not, can i adding this?

Hey @bogo96 thanks for opening this issue. I agree that we should allow 1) allowing multiple row_filters in a rule_binding and 2) allow parametrization of row_filters.

This would be a fairly involved PR but happy to support you in this if you are up to it!

You need up update the DqRowFilter class to support taking parameters and adding a method to allow resolving these parameters into a valid SQL string. See below for example code where we've added parametrization to a CUSTOM_SQL_EXPR rule_type:
https://github.com/GoogleCloudPlatform/cloud-data-quality/blob/main/clouddq/classes/dq_rule_binding.py#L246
https://github.com/GoogleCloudPlatform/cloud-data-quality/pull/124/files#diff-53a624ae4d4f1930d8966d1ddd72980a9e347567cd8b9f05ca3018eb739ebae8

Hi @thinhha @bogo96, is this issue in progress? blocked? I definitely think that a rule_filter with params could be highly more efficient than the current definition where we need to define a distinct rule_filter for each column_name that we want to use in all tables or redefine entities.

Hi @yankisimo thanks for checking in. We've started adding test cases to @bogo96's original PR in
https://github.com/GoogleCloudPlatform/cloud-data-quality/pull/205/files but this work was blocked on a few pending decisions on the API interface for defining multiple row filters in a rule bindings, among a few other higher priority bugfix.

Apologies for the delay. We will update the ticket once the feature is closer to completion.

Hi @thinhha, Sure and thanks for the update. If there's any way I can help or contribute, please let me know. On our end, we are thrilled about the significant contributions of this library and its usage in Dataplex. We have a few production projects relying on these new features and we are eagerly awaiting their availability.