googleapis/python-bigquery-pandas

attempting to change existing datatype on BQ table through `table_schema` and `if_exists="replace"` fails

Imperssonator opened this issue · 7 comments

Hello - I'm running into an issue that is proving very difficult to resolve. Have searched extensively on stack overflow, etc., and am running into this in multiple instances now.

When I have a table in BigQuery with an existing schema, if I upload new data from pandas with a table_schema that attempts to change one of the datatypes of the existing schema (let's say from STRING to INTEGER), with if_exists="replace", it will fail, with the following error (see stacktrace below, omitting calling function from our internal lib):

Expected a string or bytes dtype, got int64

But the table_schema fields for this column, let's call it column_1, are:
{'name': 'column_1', 'type': 'INT64'}
and df['column_1'].dtype yields 'Int64'

Two interesting facts that make this harder to figure out:

  1. I was previously running this on pandas 1.4.2 / pandas-gbq 0.17.2 and it was not an issue, we would switch datatypes in BQ all the time
  2. When I omit column_1 from the upload, it works; it replaces the table and overwrites the schema (omitting column_1)... and then, when I try to overwrite again with the full dataframe (including column_1), that works too, and it applies the INTEGER datatype in BigQuery. (Note it does not apply INT64 even though I specify that in the table_schema).

This suggests to me that it's not an issue with the underlying datatypes in the dataframe or the table_schema, because upload to a fresh BQ table works. For some reason, pandas-gbq is having a hard time changing existing datatypes in GBQ schemas even with full "replace" mode. I will note that I've been running into many similar issues with DATE/TIME types since upgrading as well, so it feels unlikely to me that this is isolated to just INTs.

Environment details

  • OS type and version: MacOS Ventura 13.3.1
  • Python version: python --version 3.11.4
  • pip version: pip --version 23.2.1

Running on a conda env with:
pandas 2.1.0 pypi_0 pypi
pandas-gbq 0.19.2 pypi_0 pypi
pyarrow 13.0.0 pypi_0 pypi
google-api-core 2.11.1 pypi_0 pypi
google-api-python-client 2.98.0 pypi_0 pypi
google-auth 2.22.0 pypi_0 pypi
google-auth-httplib2 0.1.0 pypi_0 pypi
google-auth-oauthlib 1.0.0 pypi_0 pypi
google-cloud-bigquery 3.11.4 pypi_0 pypi
google-cloud-bigquery-storage 2.22.0 pypi_0 pypi
google-cloud-core 2.3.3 pypi_0 pypi
google-cloud-storage 2.10.0 pypi_0 pypi
google-crc32c 1.5.0 pypi_0 pypi
google-resumable-media 2.6.0 pypi_0 pypi
googleapis-common-protos 1.60.0 pypi_0 pypi
pydata-google-auth 1.8.2 pypi_0 pypi

Steps to reproduce

  1. Have an existing BigQuery table populated with data and a field (let's call it column_1) of datatype "STRING"
  2. Attempt to upload a new DataFrame (with if_exists="replace") to that table where column_1 has pandas dtype Int64 and table_schema specifies 'type': 'INT64'
  3. Error pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64 will be thrown

Code example

Can't provide entire code sample, but final upload statement looks like:

df_bq.to_gbq(
        destination_table=dest_bq_dataset_table,
        project_id=dest_project_id,
        table_schema=table_schema_bq,
        if_exists="replace",
        location=dest_bq_location
    )

df_bq has a column with Int64 datatype
table_schema_bq contains {'name': 'column_1', 'type': 'INT64'}
Actual schema on dest_bq_dataset_table has STRING for column_1

EDIT I wrote a really simple test case that yields the same behavior:

import pandas as pd

proj_id = "" # Redacted
ds_id = "test_dataset"
table_id = "test_table_1"

df = pd.DataFrame({
    'col_1': ["1", "2", "3"]
})
print(df)
print(df.dtypes)

table_schema_str = [{
    'name': 'col_1', 'type': 'STRING'
}]

df.to_gbq(
    f"{ds_id}.{table_id}",
    project_id=proj_id,
    table_schema=table_schema_str,
    if_exists="replace"
)

df_int = df.copy()
df_int['col_1'] = df_int['col_1'].astype('Int64')
print(df_int)
print(df_int.dtypes)

table_schema_int = [{
    'name': 'col_1', 'type': 'INT64'
}]

df_int.to_gbq(
    f"{ds_id}.{table_id}",
    project_id=proj_id,
    table_schema=table_schema_int,
    if_exists="replace"
)

The first df.to_gbq will work, then df_int.to_gbq throws the same error as above.

  • This fails with any combination of "INT64", "INTEGER" in table_schema or astype("Int64") / astype(int) in the dataframe *

Stack trace

File "/Users/nils/anaconda3/envs/streamlit/lib/python3.11/site-packages/pandas/core/frame.py", line 2161, in to_gbq
    gbq.to_gbq(
  File "/Users/nils/anaconda3/envs/streamlit/lib/python3.11/site-packages/pandas/io/gbq.py", line 223, in to_gbq
    pandas_gbq.to_gbq(
  File "/Users/nils/anaconda3/envs/streamlit/lib/python3.11/site-packages/pandas_gbq/gbq.py", line 1220, in to_gbq
    connector.load_data(
  File "/Users/nils/anaconda3/envs/streamlit/lib/python3.11/site-packages/pandas_gbq/gbq.py", line 602, in load_data
    chunks = load.load_chunks(
             ^^^^^^^^^^^^^^^^^
  File "/Users/nils/anaconda3/envs/streamlit/lib/python3.11/site-packages/pandas_gbq/load.py", line 243, in load_chunks
    load_parquet(
  File "/Users/nils/anaconda3/envs/streamlit/lib/python3.11/site-packages/pandas_gbq/load.py", line 131, in load_parquet
    client.load_table_from_dataframe(
  File "/Users/nils/anaconda3/envs/streamlit/lib/python3.11/site-packages/google/cloud/bigquery/client.py", line 2702, in load_table_from_dataframe
    _pandas_helpers.dataframe_to_parquet(
  File "/Users/nils/anaconda3/envs/streamlit/lib/python3.11/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 719, in dataframe_to_parquet
    arrow_table = dataframe_to_arrow(dataframe, bq_schema)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nils/anaconda3/envs/streamlit/lib/python3.11/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 662, in dataframe_to_arrow
    bq_to_arrow_array(get_column_or_index(dataframe, bq_field.name), bq_field)
  File "/Users/nils/anaconda3/envs/streamlit/lib/python3.11/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 380, in bq_to_arrow_array
    return pyarrow.Array.from_pandas(series, type=arrow_type)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/array.pxi", line 1099, in pyarrow.lib.Array.from_pandas
  File "pyarrow/array.pxi", line 275, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
  File "/Users/nils/anaconda3/envs/streamlit/lib/python3.11/site-packages/pandas/core/arrays/masked.py", line 662, in __arrow_array__
    return pa.array(self._data, mask=self._mask, type=type)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/array.pxi", line 323, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 123, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64

+1

I believe this is due to an extra call to update_schema that was inserted in response to an issue with NULLABLE/REQUIRED fields:

https://github.com/googleapis/python-bigquery-pandas/blob/b350452cbc7b85efa4d0e0c6ec545f04b30c80a5/pandas_gbq/gbq.py#L1213C70-L1213C85

It appears that pandas-gbq now reads the existing schema from the destination table and overwrites the fields/datatypes of the supplied table_schema with whatever is already in BQ. This is IMO not expected behavior for the case where you're replacing an entire table - it almost defeats the purpose of having a user-supplied table_schema to begin with.

Further digging revealed that a previous issue (#315) noted that when using if_exists="append", pandas should retrieve the existing schema from BQ and update the user-supplied table_schema to ensure that all column modes (REQUIRED/NULLABLE) match those of the existing table.

The commit that fixed this issue applies this policy to all write_dispositions, but it is only applicable when appending. IMO, when replacing an entire table, the user should be able to replace the datatypes as well. This could be resolved by adding

if write_disposition == "WRITE_APPEND": prior to implementing the table_schema update from the existing table.

Hi @Imperssonator, thank you for reporting and digging into the issue! I see that you have created a commit to resolve this problem, and I think it's a pretty valid solution. Would you like to create a PR with it? IF yes, generally we want new code to also include at least a unit test. Do you feel comfortable adding a unit test to the PR? (If not, that is fine, we can work together to flesh out the PR fully).

@Linchin I'm happy to take a stab at that, although I can't say when I would have time to do so - hopefully in the next month or two. If you need this integrated more quickly, probably not best to wait on me. I'm operating on a fork so my issue is alleviated at least for now, but would like to get back to the main branch.

@Imperssonator I see, thank you for letting me know! I will be working on this then since it seems many people are affected by it.

It's strange that this was not caught by system tests.
Edit:
Because the system test didn't use table schema as a parameter, see here.