bxparks/bigquery-schema-generator

Integers exceeding the bigquery integer limit are still converted to integer in the schema

alvinburgos opened this issue · 2 comments

To replicate:

test.json:

{"name": "111222333444555666777"}
{"name": "111222333444555666777"}

Expected:

 % python3 -m bigquery_schema_generator.generate_schema --keep_nulls < ../data/test.json
INFO:root:Processed 2 lines
[
  {
    "mode": "NULLABLE",
    "name": "name",
    "type": "STRING"
  }
]

Actual:

 % python3 -m bigquery_schema_generator.generate_schema --keep_nulls < ../data/test.json
INFO:root:Processed 2 lines
[
  {
    "mode": "NULLABLE",
    "name": "name",
    "type": "INTEGER"
  }
]

Thanks for the bug report!
This is an unintended consequence of PR #15 which replicates the behavior of bq load which infers the integer and date types inside quoted strings. It looks like we have to limit the integer range to what bq load does. I'm going to guess it's either 64-bit signed integer or 53-bit signed integer. Probably won't get to this today, but likely tomorrow.

It turns out that when an integer overflows a signed 64-bit, bq load infers that value to be a FLOAT instead of a STRING. The same thing happens when the integer is inside a quoted string. So in your example, bq load produces:

[
  {
    "mode": "NULLABLE",
    "name": "name",
    "type": "FLOAT"
  }
]

It could be argued that STRING makes more sense for an overflowing integer inside a quoted string, but unfortunately, that's not what bq load seems to do. My fix follows the bq load convention, and produces a FLOAT (or a QFLOAT internally for quoted large integers).

This fix is on the develop branch. Do you want to give it a try before I merge it into master?