GoogleCloudPlatform/DataflowPythonSDK

Handling Null values for bigQuery IO

1byxero opened this issue · 5 comments

I have a script which fetches data from google datastore, converts each retrieved entity into python dictionary to be inserted as TableRow into google bigquery table

My issue is about handling None/null values in the TableRow dictionary.

The retrieved entities of datastore may or may not have certain attributes required in bigquery table schema. For such attributes of entity i am creating None value for the corresponding key of dictionary. Which results in list of following errors

InsertErrorsValueListEntry
 errors: [<ErrorProto
 debugInfo: u'generic::not_found: no such field.'
 location: u'foo_bar'
 message: u'no such field.'
 reason: u'invalid'>]
 index: 0>

How do i go about it

Basically my intention from this dataflow pipeline is as follows

1. Fetch data from Google datastore
2. Create a one to one mapping of datastore entities to BigQuery TableRows
3. Push the rows into BigQuery

This is How i have created the pipe

(p 
     | 'read from datastore' >> ReadFromDatastore(project, 
                      query, user_options.namespace)
     | 'convert to table rows' >> beam.Map(convert_to_table_row)
     | 'write to bq' >> beam.io.Write(
                beam.io.BigQuerySink('%s:%s.%s' % (project, dataset, table),
                schema=table_schema,
                create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)))

If i replace the Map function to custom ParDo function and try to return a json from ParDo transformation function it throws error for returning strings.

Please see: #17 it is likely related.

cc: @chamikaramj

#17 is about reading records involving null values,

I am interested in writing records which have null values

Could you mention what runner and SDK version you are using ?

If you are using DataflowRunner could you provide a job ID ?

'None' values should be written correctly. Can you verify that the column to which you write 'None' values is 'NALLABLE' and that 'convert_to_table_row' generates dictionaries that conform to the schema of the table ?

Found the issue,
The dictionaries should confirm to the schema, and should have None in the absent attributes.
That was the mistake I was making, I wasn't creating entries for None values.

Thanks for your help!