bxparks/bigquery-schema-generator

mode conflation in nullable nested, repeated records

daniel-bartley opened this issue · 2 comments

with data in a file:

{ "model": {"data": {"Inventory": {"Observations": [] }}}}
{ "model": {"data": {"Inventory": {"Observations": ["foo"] }}}}

If I manually upload the sample file to BIgquery I get schema:

{"name":"model","type":"RECORD","mode":"REPEATED","fields":[
    {"name":"data","type":"RECORD","mode":"REPEATED","fields":[
        {"name":"Inventory","type":"RECORD","mode":"REPEATED","fields":[
            {"name":"Observations","type":"STRING","mode":"NULLABLE"}
        ]}
    }
}
from bigquery_schema_generator.generate_schema import SchemaGenerator

generator = SchemaGenerator(
    infer_mode=True,
    input_format="json",
    quoted_values_are_strings=True,
    preserve_input_sort_order=True,
    keep_nulls=True,
    debugging_map=True,
    sanitize_names=True,
)
with open(file) as f:
    schema_map, errors = generator.deduce_schema(f)
if errors:
    for error in errors:
        print("Problem on line %s: %s", error['line_number'], error['msg'])
        
specs = generator.flatten_schema(schema_map) 
return [
    bigquery.SchemaField(
        name=spec["name"], field_type=spec["type"], mode=spec["mode"]
    )
    for spec in specs
]
            

But I get this error from the library:

Ignoring non-RECORD field with mismatched mode: 
old=(hard,model.data.Inventory.Observations,REPEATED,STRING); 
new=(soft,model.data.Inventory.Observations,NULLABLE,STRING)

My questions:

Docs suggest using generator.flatten_schema(schema_map)
but is there an alternative method to get a list of SchemaField in the original nested structure?
meaning: schemaFields without the flatten.

Are my batch sizes too big? What's the guidance?
I'm scanning 1000 records with 42 tags in the generated schema but that's after it eliminates all the nesting.
Even on 2MB files with ~280 records I get weird errors.

I get intermittent errors: Problem on line 278: Unsupported array element type: __null__
There are 14 nulls in line 278 and 13 in line 279, but none of them are in an array.

There is a lot to unpack in your post... let me give it a shot...

with data in a file:

{ "model": {"data": {"Inventory": {"Observations": [] }}}}
{ "model": {"data": {"Inventory": {"Observations": ["foo"] }}}}

If I manually upload the sample file to BIgquery I get schema:

{"name":"model","type":"RECORD","mode":"REPEATED","fields":[
    {"name":"data","type":"RECORD","mode":"REPEATED","fields":[
        {"name":"Inventory","type":"RECORD","mode":"REPEATED","fields":[
            {"name":"Observations","type":"STRING","mode":"NULLABLE"}
        ]}
    }
}

I don't think you sent the correct data, because the above does not cause any problems for me, and generate-schema produces the same schema as bq load:

It seems like you have not copy-pasted the correct sample data that's causing your problems, because your example above works fine for me:

$ generate-schema
{ "model": {"data": {"Inventory": {"Observations": [] }}}}
{ "model": {"data": {"Inventory": {"Observations": ["foo"] }}}}
^D
[
  {
    "fields": [
      {
        "fields": [
          {
            "fields": [
              {
                "mode": "REPEATED",
                "name": "Observations",
                "type": "STRING"
              }
            ],
            "mode": "NULLABLE",
            "name": "Inventory",
            "type": "RECORD"
          }
        ],
        "mode": "NULLABLE",
        "name": "data",
        "type": "RECORD"
      }
    ],
    "mode": "NULLABLE",
    "name": "model",
    "type": "RECORD"
  }
]

This is exactly what I get from bq after I do a bq load then a bq show --schema:

$ cat > issue94.json
{ "model": {"data": {"Inventory": {"Observations": [] }}}}
{ "model": {"data": {"Inventory": {"Observations": ["foo"] }}}}
^D

$ bq load --source_format NEWLINE_DELIMITED_JSON --autodetect tmp.issue94  issue94.json
...

$ bq show --schema tmp.issue94 | python3 -m json.tool
[
    {
        "name": "model",
        "type": "RECORD",
        "mode": "NULLABLE",
        "fields": [
            {
                "name": "data",
                "type": "RECORD",
                "mode": "NULLABLE",
                "fields": [
                    {
                        "name": "Inventory",
                        "type": "RECORD",
                        "mode": "NULLABLE",
                        "fields": [
                            {
                                "name": "Observations",
                                "type": "STRING",
                                "mode": "REPEATED"
                            }
                        ]
                    }
                ]
            }
        ]
    }
]

With regards to some of the other parts of your post:

from bigquery_schema_generator.generate_schema import SchemaGenerator

generator = SchemaGenerator(
    infer_mode=True,
    input_format="json",
    quoted_values_are_strings=True,
    preserve_input_sort_order=True,
    keep_nulls=True,

The keep_nulls flag is not really what most people want, so I'm going to guess that this is causing some, but maybe not all, your problems.

debugging_map=True,

The debugging_map=True is not necessary, because it does not do anything when using the SchemaGenerator from Python. This flag only affects the generate-schema command line script.

    sanitize_names=True,
)
with open(file) as f:
    schema_map, errors = generator.deduce_schema(f)
if errors:
    for error in errors:
        print("Problem on line %s: %s", error['line_number'], error['msg'])
        
specs = generator.flatten_schema(schema_map) 
return [
    bigquery.SchemaField(
        name=spec["name"], field_type=spec["type"], mode=spec["mode"]
    )
    for spec in specs
]
            

But I get this error from the library:

Ignoring non-RECORD field with mismatched mode: 
old=(hard,model.data.Inventory.Observations,REPEATED,STRING); 
new=(soft,model.data.Inventory.Observations,NULLABLE,STRING)

Hard to say without looking at the input data.

My questions:

Docs suggest using generator.flatten_schema(schema_map) but is there an alternative method to get a list of SchemaField in the original nested structure? meaning: schemaFields without the flatten.

I'm not sure what you are asking. The schema_map is internal raw bookkeeping used by the script. You can just print it out if you want the raw information. That's what generate-schema --debugging_map flag does. The flatten_schema() function just collapses it into a format compatible with bq load --schema flag.

Are my batch sizes too big? What's the guidance? I'm scanning 1000 records with 42 tags in the generated schema but that's after it eliminates all the nesting. Even on 2MB files with ~280 records I get weird errors.

I get intermittent errors: Problem on line 278: Unsupported array element type: __null__ There are 14 nulls in line 278 and 13 in line 279, but none of them are in an array.

Again I don't think I can't answer these questions without looking at the actual data...

Closing due to no response from issue creator.