attempting to change existing datatype on BQ table through `table_schema` and `if_exists="replace"` fails
Imperssonator opened this issue · 7 comments
Hello - I'm running into an issue that is proving very difficult to resolve. Have searched extensively on stack overflow, etc., and am running into this in multiple instances now.
When I have a table in BigQuery with an existing schema, if I upload new data from pandas with a table_schema
that attempts to change one of the datatypes of the existing schema (let's say from STRING to INTEGER), with if_exists="replace"
, it will fail, with the following error (see stacktrace below, omitting calling function from our internal lib):
Expected a string or bytes dtype, got int64
But the table_schema
fields for this column, let's call it column_1
, are:
{'name': 'column_1', 'type': 'INT64'}
and df['column_1'].dtype
yields 'Int64'
Two interesting facts that make this harder to figure out:
- I was previously running this on pandas 1.4.2 / pandas-gbq 0.17.2 and it was not an issue, we would switch datatypes in BQ all the time
- When I omit
column_1
from the upload, it works; it replaces the table and overwrites the schema (omittingcolumn_1
)... and then, when I try to overwrite again with the full dataframe (includingcolumn_1
), that works too, and it applies theINTEGER
datatype in BigQuery. (Note it does not applyINT64
even though I specify that in the table_schema).
This suggests to me that it's not an issue with the underlying datatypes in the dataframe or the table_schema, because upload to a fresh BQ table works. For some reason, pandas-gbq
is having a hard time changing existing datatypes in GBQ schemas even with full "replace" mode. I will note that I've been running into many similar issues with DATE/TIME types since upgrading as well, so it feels unlikely to me that this is isolated to just INT
s.
Environment details
- OS type and version: MacOS Ventura 13.3.1
- Python version:
python --version
3.11.4 - pip version:
pip --version
23.2.1
Running on a conda env with:
pandas 2.1.0 pypi_0 pypi
pandas-gbq 0.19.2 pypi_0 pypi
pyarrow 13.0.0 pypi_0 pypi
google-api-core 2.11.1 pypi_0 pypi
google-api-python-client 2.98.0 pypi_0 pypi
google-auth 2.22.0 pypi_0 pypi
google-auth-httplib2 0.1.0 pypi_0 pypi
google-auth-oauthlib 1.0.0 pypi_0 pypi
google-cloud-bigquery 3.11.4 pypi_0 pypi
google-cloud-bigquery-storage 2.22.0 pypi_0 pypi
google-cloud-core 2.3.3 pypi_0 pypi
google-cloud-storage 2.10.0 pypi_0 pypi
google-crc32c 1.5.0 pypi_0 pypi
google-resumable-media 2.6.0 pypi_0 pypi
googleapis-common-protos 1.60.0 pypi_0 pypi
pydata-google-auth 1.8.2 pypi_0 pypi
Steps to reproduce
- Have an existing BigQuery table populated with data and a field (let's call it
column_1
) of datatype "STRING" - Attempt to upload a new DataFrame (with
if_exists="replace"
) to that table wherecolumn_1
has pandas dtypeInt64
andtable_schema
specifies'type': 'INT64'
- Error
pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
will be thrown
Code example
Can't provide entire code sample, but final upload statement looks like:
df_bq.to_gbq(
destination_table=dest_bq_dataset_table,
project_id=dest_project_id,
table_schema=table_schema_bq,
if_exists="replace",
location=dest_bq_location
)
df_bq
has a column with Int64
datatype
table_schema_bq
contains {'name': 'column_1', 'type': 'INT64'}
Actual schema on dest_bq_dataset_table
has STRING
for column_1
EDIT I wrote a really simple test case that yields the same behavior:
import pandas as pd
proj_id = "" # Redacted
ds_id = "test_dataset"
table_id = "test_table_1"
df = pd.DataFrame({
'col_1': ["1", "2", "3"]
})
print(df)
print(df.dtypes)
table_schema_str = [{
'name': 'col_1', 'type': 'STRING'
}]
df.to_gbq(
f"{ds_id}.{table_id}",
project_id=proj_id,
table_schema=table_schema_str,
if_exists="replace"
)
df_int = df.copy()
df_int['col_1'] = df_int['col_1'].astype('Int64')
print(df_int)
print(df_int.dtypes)
table_schema_int = [{
'name': 'col_1', 'type': 'INT64'
}]
df_int.to_gbq(
f"{ds_id}.{table_id}",
project_id=proj_id,
table_schema=table_schema_int,
if_exists="replace"
)
The first df.to_gbq
will work, then df_int.to_gbq
throws the same error as above.
- This fails with any combination of "INT64", "INTEGER" in
table_schema
orastype("Int64")
/astype(int)
in the dataframe *
Stack trace
File "/Users/nils/anaconda3/envs/streamlit/lib/python3.11/site-packages/pandas/core/frame.py", line 2161, in to_gbq
gbq.to_gbq(
File "/Users/nils/anaconda3/envs/streamlit/lib/python3.11/site-packages/pandas/io/gbq.py", line 223, in to_gbq
pandas_gbq.to_gbq(
File "/Users/nils/anaconda3/envs/streamlit/lib/python3.11/site-packages/pandas_gbq/gbq.py", line 1220, in to_gbq
connector.load_data(
File "/Users/nils/anaconda3/envs/streamlit/lib/python3.11/site-packages/pandas_gbq/gbq.py", line 602, in load_data
chunks = load.load_chunks(
^^^^^^^^^^^^^^^^^
File "/Users/nils/anaconda3/envs/streamlit/lib/python3.11/site-packages/pandas_gbq/load.py", line 243, in load_chunks
load_parquet(
File "/Users/nils/anaconda3/envs/streamlit/lib/python3.11/site-packages/pandas_gbq/load.py", line 131, in load_parquet
client.load_table_from_dataframe(
File "/Users/nils/anaconda3/envs/streamlit/lib/python3.11/site-packages/google/cloud/bigquery/client.py", line 2702, in load_table_from_dataframe
_pandas_helpers.dataframe_to_parquet(
File "/Users/nils/anaconda3/envs/streamlit/lib/python3.11/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 719, in dataframe_to_parquet
arrow_table = dataframe_to_arrow(dataframe, bq_schema)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/nils/anaconda3/envs/streamlit/lib/python3.11/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 662, in dataframe_to_arrow
bq_to_arrow_array(get_column_or_index(dataframe, bq_field.name), bq_field)
File "/Users/nils/anaconda3/envs/streamlit/lib/python3.11/site-packages/google/cloud/bigquery/_pandas_helpers.py", line 380, in bq_to_arrow_array
return pyarrow.Array.from_pandas(series, type=arrow_type)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/array.pxi", line 1099, in pyarrow.lib.Array.from_pandas
File "pyarrow/array.pxi", line 275, in pyarrow.lib.array
File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
File "/Users/nils/anaconda3/envs/streamlit/lib/python3.11/site-packages/pandas/core/arrays/masked.py", line 662, in __arrow_array__
return pa.array(self._data, mask=self._mask, type=type)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/array.pxi", line 323, in pyarrow.lib.array
File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 123, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
+1
I believe this is due to an extra call to update_schema
that was inserted in response to an issue with NULLABLE/REQUIRED fields:
It appears that pandas-gbq now reads the existing schema from the destination table and overwrites the fields/datatypes of the supplied table_schema
with whatever is already in BQ. This is IMO not expected behavior for the case where you're replacing an entire table - it almost defeats the purpose of having a user-supplied table_schema to begin with.
Further digging revealed that a previous issue (#315) noted that when using if_exists="append"
, pandas should retrieve the existing schema from BQ and update the user-supplied table_schema
to ensure that all column modes (REQUIRED/NULLABLE) match those of the existing table.
The commit that fixed this issue applies this policy to all write_dispositions, but it is only applicable when appending. IMO, when replacing an entire table, the user should be able to replace the datatypes as well. This could be resolved by adding
if write_disposition == "WRITE_APPEND":
prior to implementing the table_schema
update from the existing table.
Hi @Imperssonator, thank you for reporting and digging into the issue! I see that you have created a commit to resolve this problem, and I think it's a pretty valid solution. Would you like to create a PR with it? IF yes, generally we want new code to also include at least a unit test. Do you feel comfortable adding a unit test to the PR? (If not, that is fine, we can work together to flesh out the PR fully).
@Linchin I'm happy to take a stab at that, although I can't say when I would have time to do so - hopefully in the next month or two. If you need this integrated more quickly, probably not best to wait on me. I'm operating on a fork so my issue is alleviated at least for now, but would like to get back to the main branch.
@Imperssonator I see, thank you for letting me know! I will be working on this then since it seems many people are affected by it.