debussy-labs/debussy_concert

Wrong data types for empty parquet load into BigQuery

Closed this issue · 1 comments

When loading an empty parquet file into BigQuery (through the job API) using the STRING type for all fields, some fields gets parsed as INTEGER, resulting in the following error:

BigQuery error in load operation: Error processing job '[project]:[job]': Provided Schema does not match Table [project]:[dataset].[table]. Field [attribut] has
changed type from STRING to FLOAT

Relates to this stackoverflow issue.

Environment details

  • Environment: Google Cloud Composer 1.17.9
  • Airflow version: 2.1.4

Steps to reproduce

  1. Create an empty parquet file
  2. Create a BigQuery table with the same schema of the previous file, but with all fields as STRING
  3. Create a data pipeline using the DataIngestionBase Composition, to read the file and load it into the BigQuery table

After executing some tests through bq CLI, we observed the problem was the schema missing in the BigQuery load job.

@NiltonDuarte already issued a hot fix, and a final solution will be implemented on the same branch fix_bq_load_with_schema.

Basically, we'll provide the table schema to the load job - this will require a new field on the table definition yaml files called is_metadata that will be used to ignore this metadata fields.