Pandas should get the schema from bigquery if pushing to a table that already exists
ShantanuKumar opened this issue · 6 comments
Right now, when pushing new data to an already existing table using to_gbq
, with option if_exists=append
, but no explicit table_schema
, pandas generates a default table schema, where the mode
of the column, which takes value either REQUIRED
or NULLABLE
, by default is always NULLABLE
.
It would make sense for pandas to fetch schema, and apply those for case where if_exists=append
instead of passing a NULLABLE
mode.
pandas-gbq already calls get_table
to check if a table exists. I think this could be a matter of plumbing the right information through.
Note: you'll need to filter out any columns that aren't in the DataFrame.
@ShantanuKumar Would you mind sharing an example of when the default schema gets it wrong? That will be useful for testing.
Imagine there is a table called event_data
with this schema
[
{
"name": "event_ts",
"type": "TIMESTAMP",
"mode": "REQUIRED"
},
{
"name": "event_type",
"type": "STRING",
"mode": "NULLABLE"
}
]
Now, when we have some dataframe
df = pd.DatFrame({
"event_ts": ["2020-03-03 01:00:00", "2020-03-03 02:00:00"],
"event_type": ["buy", "sell"]
})
When I push this data to the table event_data
, using
df.to_gbq(
destination_table="event_ts",
project_id=PROJECT_ID,
if_exists="append",
)
I get this error
"Please verify that the structure and "
pandas_gbq.gbq.InvalidSchema: Please verify that the structure and data types in the DataFrame match the schema of the destination table.
This is happening because of the REQUIRED
mode for event_ts
, which pandas isn't obeying, and pushing its own schema where event_ts
mode is NULLABLE
@tswast What's the reason behind ignoring the mode
when comparing schema?
https://github.com/pydata/pandas-gbq/blob/master/pandas_gbq/gbq.py#L646
The issue which I have right now is because of mismatching mode
when appending data to an exsiting table. pandas-gbq
by default always pushes NULLABLE
mode which creates an issue if the existing table has some column with REQUIRED
mode
Also I think this issue should be marked as BUG.
I made a test corresponding to the example you provided in #315 (comment) but it still fails due to different types. It is expected that the pandas.Timestamp dtype is used for uploading to TIMESTAMP columns. Please open a separate feature request for that issue if different types is a problem for you.
Let's use this issue to track the problem different modes (required vs. nullable).
def test_to_gbq_does_not_override_type(gbq_table, gbq_connector):
table_id = "test_to_gbq_does_not_override_type"
table_schema = {
"fields": [
{
"name": "event_ts",
"type": "TIMESTAMP",
},
{
"name": "event_type",
"type": "STRING",
},
]
}
df = DataFrame({
"event_ts": ["2020-03-03 01:00:00", "2020-03-03 02:00:00"],
"event_type": ["buy", "sell"]
})
gbq_table.create(table_id, table_schema)
gbq.to_gbq(
df,
"{0}.{1}".format(gbq_table.dataset_id, table_id),
project_id=gbq_connector.project_id,
if_exists="append",
)
actual = gbq_table.schema(table_id)
assert table_schema["fields"] == actual
It actually works for TIMESTAMP
also ! We just needed to cast the string timestamp to datetime
def test_to_gbq_does_not_override_type(gbq_table, gbq_connector):
table_id = "test_to_gbq_does_not_override_type"
table_schema = {
"fields": [
{
"name": "event_ts",
"type": "TIMESTAMP",
"mode": "REQUIRED",
"description": "event_ts",
},
{
"name": "event_type",
"type": "STRING",
"mode": "NULLABLE",
"description": "event_type",
},
]
}
df = DataFrame({
"event_ts": [pandas.to_datetime("2020-03-03 01:00:00"),
pandas.to_datetime("2020-03-03 01:00:00")],
"event_type": ["buy", "sell"]
})
gbq_table.create(table_id, table_schema)
gbq.to_gbq(
df,
"{0}.{1}".format(gbq_table.dataset_id, table_id),
project_id=gbq_connector.project_id,
if_exists="append",
)
actual = gbq_table.schema(gbq_table.dataset_id, table_id)
assert table_schema["fields"] == actual