bxparks/bigquery-schema-generator

[BUG] - empty nested records with keep_nulls=False produce a record with no fields

abroglesc opened this issue · 6 comments

Summary

When we remove nulls we are only removing inner nulls and not the case where a record has all fields removed during this process. This produces the following error when attempting to load into BigQuery with the generated schema:
Field outer_nested_record is type RECORD but has no schema.

Example input

test_data.json

{"test": "thing", "empty_record": {}, "outer_nested_record": {"inner_empty_record": {}}}

Command Ran

generate-schema --input_format json --quoted_values_are_strings < test_file.json

Current Output

[
  {
    "fields": [],
    "mode": "NULLABLE",
    "name": "outer_nested_record",
    "type": "RECORD"
  },
  {
    "mode": "NULLABLE",
    "name": "test",
    "type": "STRING"
  }
]

Expected Output

[
  {
    "mode": "NULLABLE",
    "name": "test",
    "type": "STRING"
  }
]

Some additional info, I believe this is caused by the outer_nested_record being marked as "status": "hard" since it contains an inner record

https://github.com/bxparks/bigquery-schema-generator/blob/develop/bigquery_schema_generator/generate_schema.py#L744

I think for this we basically need to check if every inner record has a "filled": false then if so, we can mark this outer as "status": "soft" and "filled": false

Created pull request #56 to remediate this. @bxparks please help take a look at that when you get a chance :)

At this time this doesn't actually load into BigQuery as the empty record exists in the JSON but not the schema. I will think about this a little as it pertains to my data and potentially close this issues/associated PR

I recall that the handling of null, empty record {}, and empty array [] (and nested combinations of those) in the bq load command seemed inconsistent to me, or at least the pattern was not obvious to me. I don't actually know what the proper handling of your example should be.

Closed #56 with

I have found that removing those in the schema generator will not allow bigquery to load the JSON data as there will be an empty record in the JSON but not the schema. Thus I believe this is working as intended and I can close this PR as not needed. If I want those removed I need to perform data cleaning prior to running the schema generator.

Closing this as well.

Instead of running a data cleansing preprocessor, you might be able to generate the schema using bigquery-schema-generator, without --keep_nulls, then import using bq load --ignore_unknown_values, which will ignore those null values. At least, that's my vague recollection from similar problems that I ran into a few years ago when I created this.