audienceproject/spark-dynamodb

Query Functionality

Opened this issue · 10 comments

In the article published about the project, it was said that "future improvements could see the query operation be given a part to play as well".

Is adding a query functionality to this project on the roadmap for the near future?
Thanks

Hi fogrid
Unfortunately we are not actively working on any improvements at the moment.
The idea behind this improvement would be to reduce read throughput when any filters present on the Spark query can be translated into conditions on the hash and/or range key.
Do you currently have such a use case?
Thank you for your interest in the project :)

Yea, I want to get a small subset of keys from a very big table.
What would adding this functionality entail?

Hi fogrid
I imagine it would entail implementing a QueryPartition class similar to the existing ScanPartition, with the API query operation implemented alongside the other API calls in TableConnector.
This QueryPartition would then be used in place of ScanPartition in the method planInputPartitions in DynamoDataSourceReader, based on some analysis of the Spark schema and Dynamo table schema to determine if a query would be applicable (and expedient) to use.

You are most welcome to contribute :)
However if you need only a small subset of data from your table, it would probably be less work for you to query Dynamo directly and put the data into a Spark dataframe manually (and not use the library).

Thanks,
Jacob

I started working on this feature here ee7c0f6
Can't promise anything but at least now there is a place to track progress 😀

Hi jacobi,

Are you still actively working on adding support for query operation?

Hi ff-parasp
No sadly, I am not actively working on this library at the moment.

Hi,
I can work on this, @jacobfi mind if i continue from your feature branch?

Hi amrnablus
You are most welcome.
However be aware that the branch needs to be synced up with master, which underwent a lot of changes when it was migrated to Spark 3. Probably a rebase is the way to go, to isolate the query-related changes on the feature branch and play them on top of the new master.

Ah good point! I'll just start a new feature from master and copy of your changes manually, this should be easier.
Thanks @jacobfi

Can I use the connector as the static DF for stream-static join (based on the key) ? Would it be efficient ?