bigquery load_table_from_dataframe from string type will show null values

Question

bigquery load_table_from_dataframe from string type will show null values

superbeer opened this issue a year ago · 3 comments

superbeer commented a year ago

Environment details

OS type and version: Mac os
Python version: python --version Python 3.11.6
pip version: pip --version pip 23.2.1

Name: google-cloud-bigquery
Version: 3.13.0

Steps to reproduce

Code example

from google.cloud import bigquery
import pandas as pd

# Construct a BigQuery client object.
client = bigquery.Client()

# TODO(developer): Set table_id to the ID of the table to create.
table_id = "{project_id}.{dataset_id}.{table_id}"


df=pd.read_csv("data.csv")
print (df.dtypes)
df= df.astype(str) 
print (df.dtypes)
print(df)


job_config = bigquery.LoadJobConfig(

    write_disposition="WRITE_TRUNCATE",
)


job = client.load_table_from_dataframe(
    df, table_id, job_config=job_config
)  # Make an API request.
job.result()  # Wait for the job to complete.

Stack trace

data.csv

# example
col_a,col_b,col_c
a,,
aa,bb,cc
aaa,,ccc

it is show

 col_a col_b col_c
0     a   nan   nan
1    aa    bb    cc
2   aaa   nan   ccc

on bq show

 col_a col_b col_c
a   nan   nan
aa    bb    cc
aaa   nan   ccc

i think data on bq will show

 col_a col_b col_c
a   null   null
aa    bb    cc
aaa   null   ccc

ps if data is dtypes object and value NaN . it is work

Answer 1 · 2024-02-02T23:43:43.000Z

Hi @superbeer, thank you for raising the issue. Could you clarify which value you are expecting - NaN or null?

Answer 2 · 2024-02-05T06:35:06.000Z

Hi @superbeer, thank you for raising the issue. Could you clarify which value you are expecting - NaN or null?

in bigquery expecting null value

Answer 3 · 2024-05-07T21:11:32.000Z

I think this is a problem with pandas.read_csv(). The dataframe loaded by pandas is already using nan for null strings, so BigQuery is working as intended. It seems that many people have the same problem with loading null strings from csv, and adding a parameter can resolve the problem, like this: pandas.read_csv("data.csv", keep_default_na=False). (see stack overflow)

I'm closing the issue, but feel free to leave a comment or open a new issue, if you have any further questions. :)