null values in BigQuery

Question

null values in BigQuery

tuanavu opened this issue 8 years ago · 2 comments

The issue I have is when I read data from GBQ with null values, then if I try to map any function with the column of null values, it will give errors.

When I write the input_data from GBQ to text, the json output file does not have the key with null values. I believe this is an issue that need to be fixed.

For example:

Input

key_1,key_2,key_3
value_1,,value_3

Expected output:

{"key_1":"value_1","key_2":null,"key_3":"value_3"}

Output from Dataflow

{"key_1":"value_1","key_3":"value_3"}

Answer 1 · 2016-05-25T15:50:22.000Z

Ack. Investigating ...

Answer 2 · 2016-05-27T21:52:12.000Z

For now, there is not much we can do at Dataflow level. As you pointed out the JSON coming out of BigQuery does not have the null values. This will be improved (but not in the next immediate release) if we switch to AVRO as an intermediary format for exports. You can insert a one-liner function to "clean up" the data by adding the missing nullable fields. See the example below:

def add_null_field(row, field):
  row.update({field: row.get(field, None)})
  return row

(p 
  | df.io.Read(df.io.BigQuerySource( 'PROJECT:DATASET.TABLE'))
  | df.Map(add_null_field, field='value')
  | df.io.Write(df.io.TextFileSink('gs://BUCKET/FILES)))

Hope this helps.