Pandas should get the schema from bigquery if pushing to a table that already exists

Question

Pandas should get the schema from bigquery if pushing to a table that already exists

ShantanuKumar opened this issue 5 years ago · 6 comments

Right now, when pushing new data to an already existing table using to_gbq, with option if_exists=append, but no explicit table_schema, pandas generates a default table schema, where the mode of the column, which takes value either REQUIRED or NULLABLE, by default is always NULLABLE.

It would make sense for pandas to fetch schema, and apply those for case where if_exists=append instead of passing a NULLABLE mode.

Answer 1 · 2020-03-09T15:23:42.000Z

pandas-gbq already calls get_table to check if a table exists. I think this could be a matter of plumbing the right information through.

Note: you'll need to filter out any columns that aren't in the DataFrame.

Answer 2 · 2020-03-09T15:24:37.000Z

@ShantanuKumar Would you mind sharing an example of when the default schema gets it wrong? That will be useful for testing.

Answer 3 · 2020-03-10T15:22:03.000Z

Imagine there is a table called event_data with this schema

[
    {
        "name": "event_ts",
        "type": "TIMESTAMP",
        "mode": "REQUIRED"
    },
    {
        "name": "event_type",
        "type": "STRING",
        "mode": "NULLABLE"
    }
]

Now, when we have some dataframe

df = pd.DatFrame({
    "event_ts": ["2020-03-03 01:00:00", "2020-03-03 02:00:00"],
    "event_type": ["buy", "sell"]
})

When I push this data to the table event_data, using

df.to_gbq(
    destination_table="event_ts",
    project_id=PROJECT_ID,
    if_exists="append",
)

I get this error

"Please verify that the structure and "
pandas_gbq.gbq.InvalidSchema: Please verify that the structure and data types in the DataFrame match the schema of the destination table.

This is happening because of the REQUIRED mode for event_ts, which pandas isn't obeying, and pushing its own schema where event_ts mode is NULLABLE

Answer 4 · 2020-04-25T12:39:57.000Z

@tswast What's the reason behind ignoring the mode when comparing schema?

https://github.com/pydata/pandas-gbq/blob/master/pandas_gbq/gbq.py#L646

The issue which I have right now is because of mismatching mode when appending data to an exsiting table. pandas-gbq by default always pushes NULLABLE mode which creates an issue if the existing table has some column with REQUIRED mode

Also I think this issue should be marked as BUG.

Answer 5 · 2020-04-29T19:51:27.000Z

I made a test corresponding to the example you provided in #315 (comment) but it still fails due to different types. It is expected that the pandas.Timestamp dtype is used for uploading to TIMESTAMP columns. Please open a separate feature request for that issue if different types is a problem for you.

Let's use this issue to track the problem different modes (required vs. nullable).

def test_to_gbq_does_not_override_type(gbq_table, gbq_connector):
    table_id = "test_to_gbq_does_not_override_type"
    table_schema = {
        "fields": [
            {
                "name": "event_ts",
                "type": "TIMESTAMP",
            },
            {
                "name": "event_type",
                "type": "STRING",
            },
        ]
    }
    df = DataFrame({
        "event_ts": ["2020-03-03 01:00:00", "2020-03-03 02:00:00"],
        "event_type": ["buy", "sell"]
    })

    gbq_table.create(table_id, table_schema)
    gbq.to_gbq(
        df,
        "{0}.{1}".format(gbq_table.dataset_id, table_id),
        project_id=gbq_connector.project_id,
        if_exists="append",
    )

    actual = gbq_table.schema(table_id)
    assert table_schema["fields"] == actual

Answer 6 · 2020-04-30T19:10:18.000Z

It actually works for TIMESTAMP also ! We just needed to cast the string timestamp to datetime

def test_to_gbq_does_not_override_type(gbq_table, gbq_connector):
    table_id = "test_to_gbq_does_not_override_type"
    table_schema = {
        "fields": [
            {

                "name": "event_ts",
                "type": "TIMESTAMP",
                "mode": "REQUIRED",
                "description": "event_ts",
            },
            {
                "name": "event_type",
                "type": "STRING",
                "mode": "NULLABLE",
                "description": "event_type",
            },
        ]
    }
    df = DataFrame({
        "event_ts": [pandas.to_datetime("2020-03-03 01:00:00"),
                     pandas.to_datetime("2020-03-03 01:00:00")],
        "event_type": ["buy", "sell"]
    })

    gbq_table.create(table_id, table_schema)
    gbq.to_gbq(
        df,
        "{0}.{1}".format(gbq_table.dataset_id, table_id),
        project_id=gbq_connector.project_id,
        if_exists="append",
    )

    actual = gbq_table.schema(gbq_table.dataset_id, table_id)
    assert table_schema["fields"] == actual