googleapis/python-bigquery

`load_table_from_dataframe` doesn't respect `schema` fields order when specified

aaaaahaaaaa opened this issue · 2 comments

load_table_from_dataframe is currently following the order of the columns in the dataframe instead of the fields in the schema (if specified) when creating a table for the first time. Which results in the final schema in BQ to be technically different than the one specified.

Additionally, this is inconsistent with load_table_from_json which does respect the specified schema.

schema = [
    SchemaField("one", "STRING"),
    SchemaField("two", "STRING"),
    SchemaField("three", "STRING"),
]

df = pd.DataFrame(
    {
        "two":  ["whatever"],
        "one": ["whatever"],
        "three":  ["whatever"],
    }
)

client.load_table_from_dataframe(
    df,
    "TABLE_ID",
    job_config=LoadJobConfig(
        write_disposition="WRITE_TRUNCATE",
        schema=schema,
    ),
).result()

Resulting BQ schema:

[
  {
    "name": "two",
    "mode": "NULLABLE",
    "type": "STRING",
    "description": null,
    "fields": []
  },
  {
    "name": "one",
    "mode": "NULLABLE",
    "type": "STRING",
    "description": null,
    "fields": []
  },
  {
    "name": "three",
    "mode": "NULLABLE",
    "type": "STRING",
    "description": null,
    "fields": []
  }
]

Thank you @aaaaahaaaaa for raising the issue. This is indeed a behavior difference between load_table_from_json() and load_table_from_dataframe(), but I think these two data types are very different, because the fields of a JSON object is unordered, but the columns of a dataframe are ordered. If you really want to ensure the loaded dataframes have the same schema order as you provided, you can reorganize the columns locally using some methods like provided here.

I will close the issue now, but feel free to leave a comment here or open a new issue, if you have any further suggestions.