GoogleCloudPlatform/DataflowPythonSDK

ValueError: Query cannot have any inequality filters.

1byxero opened this issue · 5 comments

The above error is raised when following query filter is used

datastore_helper.set_property_filter(query.filter, 'foo_attribute', PropertyFilter.GREATER_THAN_OR_EQUAL, 'bar_value')

after googling the issue, found out that same issue is for Java SDK
My question is
1. Is there any work around for this?
2. If I am going wrong in usage, what might be the right way?

cph6 commented

This is an underlying Datastore limitation, AIUI. For Java it appears in the client library. It's to do with how https://github.com/GoogleCloudPlatform/appengine-mapreduce/wiki/ScatterPropertyImplementation is implemented to calculate sharding.

Workaround: set num_splits=1 so it does not try to shard the query. That works okay if the amount of data that you are reading is small, but means that there is no parallelism at the read step.
Or, drop the filter and filter as a transform in Dataflow. That means reading more data, however.

How is it possible that this doesn't work? Isn't this entire example (https://github.com/amygdala/gae-dataflow) built around the fact that they read in tweets that are from the last 4 days? Specifically this code:

def make_query(kind):
  """Creates a Cloud Datastore query to retrieve all entities with a
  'created_at' date > N days ago.
  """
  days = 4
  now = datetime.datetime.now()
  earlier = now - datetime.timedelta(days=days)

  query = query_pb2.Query()
  query.kind.add().name = kind

  datastore_helper.set_property_filter(query.filter, 'created_at',
                                       PropertyFilter.GREATER_THAN,
                                       earlier)

  return query

For the workaround, where do I set num_splits? It doesn't seem to be an argument to datatore_helper.set_property_filter (https://github.com/GoogleCloudPlatform/google-cloud-datastore/blob/master/python/googledatastore/helper.py#L357)

If i am not wrong num_splits is an argument to datastore reader.
Also if inequality filters are used, then the sdk takes care of it itself and automatically creates only one query, and the query isnt split

We moved to Apache Beam!

Google Cloud Dataflow for Python is now Apache Beam Python SDK and the code development moved to the Apache Beam repo.

If you want to contribute to the project (please do!) use this Apache Beam contributor's guide. Closing out this issue accordingly.