ValueError: Query cannot have any inequality filters.
1byxero opened this issue · 5 comments
The above error is raised when following query filter is used
datastore_helper.set_property_filter(query.filter, 'foo_attribute', PropertyFilter.GREATER_THAN_OR_EQUAL, 'bar_value')
after googling the issue, found out that same issue is for Java SDK
My question is
1. Is there any work around for this?
2. If I am going wrong in usage, what might be the right way?
This is an underlying Datastore limitation, AIUI. For Java it appears in the client library. It's to do with how https://github.com/GoogleCloudPlatform/appengine-mapreduce/wiki/ScatterPropertyImplementation is implemented to calculate sharding.
Workaround: set num_splits=1 so it does not try to shard the query. That works okay if the amount of data that you are reading is small, but means that there is no parallelism at the read step.
Or, drop the filter and filter as a transform in Dataflow. That means reading more data, however.
How is it possible that this doesn't work? Isn't this entire example (https://github.com/amygdala/gae-dataflow) built around the fact that they read in tweets that are from the last 4 days? Specifically this code:
def make_query(kind):
"""Creates a Cloud Datastore query to retrieve all entities with a
'created_at' date > N days ago.
"""
days = 4
now = datetime.datetime.now()
earlier = now - datetime.timedelta(days=days)
query = query_pb2.Query()
query.kind.add().name = kind
datastore_helper.set_property_filter(query.filter, 'created_at',
PropertyFilter.GREATER_THAN,
earlier)
return query
For the workaround, where do I set num_splits
? It doesn't seem to be an argument to datatore_helper.set_property_filter
(https://github.com/GoogleCloudPlatform/google-cloud-datastore/blob/master/python/googledatastore/helper.py#L357)
If i am not wrong num_splits
is an argument to datastore reader.
Also if inequality filters are used, then the sdk takes care of it itself and automatically creates only one query, and the query isnt split
We moved to Apache Beam!
Google Cloud Dataflow for Python is now Apache Beam Python SDK and the code development moved to the Apache Beam repo.
If you want to contribute to the project (please do!) use this Apache Beam contributor's guide. Closing out this issue accordingly.