Integers exceeding the bigquery integer limit are still converted to integer in the schema
alvinburgos opened this issue · 2 comments
To replicate:
test.json
:
{"name": "111222333444555666777"}
{"name": "111222333444555666777"}
Expected:
% python3 -m bigquery_schema_generator.generate_schema --keep_nulls < ../data/test.json
INFO:root:Processed 2 lines
[
{
"mode": "NULLABLE",
"name": "name",
"type": "STRING"
}
]
Actual:
% python3 -m bigquery_schema_generator.generate_schema --keep_nulls < ../data/test.json
INFO:root:Processed 2 lines
[
{
"mode": "NULLABLE",
"name": "name",
"type": "INTEGER"
}
]
Thanks for the bug report!
This is an unintended consequence of PR #15 which replicates the behavior of bq load
which infers the integer and date types inside quoted strings. It looks like we have to limit the integer range to what bq load
does. I'm going to guess it's either 64-bit signed integer or 53-bit signed integer. Probably won't get to this today, but likely tomorrow.
It turns out that when an integer overflows a signed 64-bit, bq load
infers that value to be a FLOAT
instead of a STRING
. The same thing happens when the integer is inside a quoted string. So in your example, bq load
produces:
[
{
"mode": "NULLABLE",
"name": "name",
"type": "FLOAT"
}
]
It could be argued that STRING
makes more sense for an overflowing integer inside a quoted string, but unfortunately, that's not what bq load
seems to do. My fix follows the bq load
convention, and produces a FLOAT
(or a QFLOAT
internally for quoted large integers).
This fix is on the develop
branch. Do you want to give it a try before I merge it into master
?