cal-itp/data-infra

Bug: Numeric Column Name Breaking Airtable Quality Pipeline

Closed this issue · 1 comments

Describe the bug
As described here in the Cal-ITP Slack, the external table creation process for the Airtable-sourced transit_data_quality_issues__services model has been erroring for a couple days, which is causing a couple additional downstream errors. This is because of an unhandled column in the source Airtable whose name begins with a number, unsupported by BQ column naming conventions.

To Reproduce
Any attempt to create an external table using a column whose name begins with a numeric character will fail in the same way that the external_airtable_transit_data_quality_issues_services task is currently failing in Airflow - example logs here.

Expected behavior
Evan has indicated that the column in question need not be present on the BQ version of this table, so it can either be handled via a rename or ignored entirely. Once that's done, downstream modeling impacts on the dbt_run_and_upload_artifacts DAG and dbt tests should passively resolve.

Additional context
Airtable column names that begin with numeric characters are not new to this ecosystem, and we've handled them gracefully before. Refer to treatment of e.g. external_airtable.california_transit__services for how to navigate such column names in a fashion that's canonically aligned with the rest of our data infra.

This task will be completed in concert with Vivek.

From what I can gather, currently external_airtable_transit_data_quality_issues_services does not have a custom schema specified. Thus all elements in the jsonl are used from the original table. When bq tries to use the field 2021_ntd_id__from_provider_ it fails to make the column and the table.

For external_airtable.california_transit__services there is a custom schema where 2021_ntd_id__from_provider_ is omitted, thus the table builds correctly.

One solution is to make a custom schema for external_airtable_transit_data_quality_issues_services.

Alternatively, we can modify

. Right now the behavior doesn't handle numbers as the first character. We can modify this to insert an _ if a numeral is detected at the first character. We should do some logging to see how many different fields are manipulated so there isn't any unintended consequences.