GoogleCloudPlatform/DataflowPythonSDK

null values in BigQuery

tuanavu opened this issue · 2 comments

The issue I have is when I read data from GBQ with null values, then if I try to map any function with the column of null values, it will give errors.

When I write the input_data from GBQ to text, the json output file does not have the key with null values. I believe this is an issue that need to be fixed.

For example:

  • Input
key_1,key_2,key_3
value_1,,value_3
  • Expected output:
{"key_1":"value_1","key_2":null,"key_3":"value_3"}
  • Output from Dataflow
{"key_1":"value_1","key_3":"value_3"}

Ack. Investigating ...

For now, there is not much we can do at Dataflow level. As you pointed out the JSON coming out of BigQuery does not have the null values. This will be improved (but not in the next immediate release) if we switch to AVRO as an intermediary format for exports. You can insert a one-liner function to "clean up" the data by adding the missing nullable fields. See the example below:

def add_null_field(row, field):
  row.update({field: row.get(field, None)})
  return row

(p 
  | df.io.Read(df.io.BigQuerySource( 'PROJECT:DATASET.TABLE'))
  | df.Map(add_null_field, field='value')
  | df.io.Write(df.io.TextFileSink('gs://BUCKET/FILES)))

Hope this helps.