mode conflation in nullable nested, repeated records
daniel-bartley opened this issue · 2 comments
with data in a file:
{ "model": {"data": {"Inventory": {"Observations": [] }}}}
{ "model": {"data": {"Inventory": {"Observations": ["foo"] }}}}
If I manually upload the sample file to BIgquery I get schema:
{"name":"model","type":"RECORD","mode":"REPEATED","fields":[
{"name":"data","type":"RECORD","mode":"REPEATED","fields":[
{"name":"Inventory","type":"RECORD","mode":"REPEATED","fields":[
{"name":"Observations","type":"STRING","mode":"NULLABLE"}
]}
}
}
from bigquery_schema_generator.generate_schema import SchemaGenerator
generator = SchemaGenerator(
infer_mode=True,
input_format="json",
quoted_values_are_strings=True,
preserve_input_sort_order=True,
keep_nulls=True,
debugging_map=True,
sanitize_names=True,
)
with open(file) as f:
schema_map, errors = generator.deduce_schema(f)
if errors:
for error in errors:
print("Problem on line %s: %s", error['line_number'], error['msg'])
specs = generator.flatten_schema(schema_map)
return [
bigquery.SchemaField(
name=spec["name"], field_type=spec["type"], mode=spec["mode"]
)
for spec in specs
]
But I get this error from the library:
Ignoring non-RECORD field with mismatched mode:
old=(hard,model.data.Inventory.Observations,REPEATED,STRING);
new=(soft,model.data.Inventory.Observations,NULLABLE,STRING)
My questions:
Docs suggest using generator.flatten_schema(schema_map)
but is there an alternative method to get a list of SchemaField in the original nested structure?
meaning: schemaFields without the flatten.
Are my batch sizes too big? What's the guidance?
I'm scanning 1000 records with 42 tags in the generated schema but that's after it eliminates all the nesting.
Even on 2MB files with ~280 records I get weird errors.
I get intermittent errors: Problem on line 278: Unsupported array element type: __null__
There are 14 nulls in line 278 and 13 in line 279, but none of them are in an array.
There is a lot to unpack in your post... let me give it a shot...
with data in a file:
{ "model": {"data": {"Inventory": {"Observations": [] }}}} { "model": {"data": {"Inventory": {"Observations": ["foo"] }}}}
If I manually upload the sample file to BIgquery I get schema:
{"name":"model","type":"RECORD","mode":"REPEATED","fields":[ {"name":"data","type":"RECORD","mode":"REPEATED","fields":[ {"name":"Inventory","type":"RECORD","mode":"REPEATED","fields":[ {"name":"Observations","type":"STRING","mode":"NULLABLE"} ]} } }
I don't think you sent the correct data, because the above does not cause any problems for me, and generate-schema
produces the same schema as bq load
:
It seems like you have not copy-pasted the correct sample data that's causing your problems, because your example above works fine for me:
$ generate-schema
{ "model": {"data": {"Inventory": {"Observations": [] }}}}
{ "model": {"data": {"Inventory": {"Observations": ["foo"] }}}}
^D
[
{
"fields": [
{
"fields": [
{
"fields": [
{
"mode": "REPEATED",
"name": "Observations",
"type": "STRING"
}
],
"mode": "NULLABLE",
"name": "Inventory",
"type": "RECORD"
}
],
"mode": "NULLABLE",
"name": "data",
"type": "RECORD"
}
],
"mode": "NULLABLE",
"name": "model",
"type": "RECORD"
}
]
This is exactly what I get from bq
after I do a bq load
then a bq show --schema
:
$ cat > issue94.json
{ "model": {"data": {"Inventory": {"Observations": [] }}}}
{ "model": {"data": {"Inventory": {"Observations": ["foo"] }}}}
^D
$ bq load --source_format NEWLINE_DELIMITED_JSON --autodetect tmp.issue94 issue94.json
...
$ bq show --schema tmp.issue94 | python3 -m json.tool
[
{
"name": "model",
"type": "RECORD",
"mode": "NULLABLE",
"fields": [
{
"name": "data",
"type": "RECORD",
"mode": "NULLABLE",
"fields": [
{
"name": "Inventory",
"type": "RECORD",
"mode": "NULLABLE",
"fields": [
{
"name": "Observations",
"type": "STRING",
"mode": "REPEATED"
}
]
}
]
}
]
}
]
With regards to some of the other parts of your post:
from bigquery_schema_generator.generate_schema import SchemaGenerator generator = SchemaGenerator( infer_mode=True, input_format="json", quoted_values_are_strings=True, preserve_input_sort_order=True, keep_nulls=True,
The keep_nulls
flag is not really what most people want, so I'm going to guess that this is causing some, but maybe not all, your problems.
debugging_map=True,
The debugging_map=True
is not necessary, because it does not do anything when using the SchemaGenerator
from Python. This flag only affects the generate-schema
command line script.
sanitize_names=True, ) with open(file) as f: schema_map, errors = generator.deduce_schema(f) if errors: for error in errors: print("Problem on line %s: %s", error['line_number'], error['msg']) specs = generator.flatten_schema(schema_map) return [ bigquery.SchemaField( name=spec["name"], field_type=spec["type"], mode=spec["mode"] ) for spec in specs ]
But I get this error from the library:
Ignoring non-RECORD field with mismatched mode: old=(hard,model.data.Inventory.Observations,REPEATED,STRING); new=(soft,model.data.Inventory.Observations,NULLABLE,STRING)
Hard to say without looking at the input data.
My questions:
Docs suggest using generator.flatten_schema(schema_map) but is there an alternative method to get a list of SchemaField in the original nested structure? meaning: schemaFields without the flatten.
I'm not sure what you are asking. The schema_map
is internal raw bookkeeping used by the script. You can just print it out if you want the raw information. That's what generate-schema --debugging_map
flag does. The flatten_schema()
function just collapses it into a format compatible with bq load --schema
flag.
Are my batch sizes too big? What's the guidance? I'm scanning 1000 records with 42 tags in the generated schema but that's after it eliminates all the nesting. Even on 2MB files with ~280 records I get weird errors.
I get intermittent errors:
Problem on line 278: Unsupported array element type: __null__
There are 14 nulls in line 278 and 13 in line 279, but none of them are in an array.
Again I don't think I can't answer these questions without looking at the actual data...
Closing due to no response from issue creator.