GoogleCloudPlatform/cloud-data-quality

column_id is NULL for CUSTOM_SQL_STATEMENT rules in summary table

Closed this issue · 5 comments

Screenshot 2022-01-02 at 17 30 07

For rules of type 'CUSTOM_SQL_STATEMENT', the corresponding column_id in the summary table is null.

Is this a desired behaviour ?

Hi there, thanks for trying CloudDQ and leaving feedback.

TL;DR it is intended.
Custom SQL statement rules can technically validate multiple columns (as is the case with the duplicate check right) or any column not referenced in the rule binding, hence why column id is set to null.

Let us know if this doesn't work for you and if so, the use-case you're trying to achieve.

Hi @thinhha, thanks for the quick clarification.

The use-case we want to archive is to have more generic custom tests, especially with [multiple] input parameters while still be able to see the corresponding column in the summary results.

One possible test for checking if values are between are specific range is e.g:

  CUSTOM_SQL_LENGTH_BETWEEN:  
      rule_type: CUSTOM_SQL_STATEMENT 
      dimension: correctness  
      params:  
        custom_sql_arguments:  
          - min  
          - max
        custom_sql_statement: |-  
          SELECT $column  
          FROM data  
          WHERE LENGTH($column) >= $min AND LENGTH($column) <= $max

This is currently not possible with the CUSTOM_SQL_EXPR rule type, as it does not allow for any custom_sql_arguments.
However, with the CUSTOM_SQL_EXPR the corresponding column is referenced in the summary result.

Would one possible solution be, to add arguments to the CUSTOM_SQL_EXPR rule type as well ?

Thanks for the detailed use-case description!

Yes you're right. It makes sense to add parameterization to custom SQL expression rules.

I'll prioritize this FR in the short-term and will update this ticket once it is released.

@tooobsias FYI this change has been implemented here: #124

This will be released in version 0.5.1 which is planned for next week.