feast-dev/feast

Add Hive Support

bennfocus opened this issue · 11 comments

I need to use Hive for offline-store, and Redis for online.
Since in current 0.11+ roadmap, the Hive support was not included.
I'd like to work on this, just add this issue for tracking.

FYI, Since there is no much choice for hive python client. I will use PyHive Impyla as a dependency.

woop commented

Thanks @baineng. You simply need to create a new OfflineStore class. Some more details here https://docs.feast.dev/feast-on-kubernetes/user-guide/extending-feast#custom-offlinestore.

I'd recommend keeping the class as an external dependency of Feast at the start (a new package). We can link to it from our docs and include it in our tests, but the repo can start out as yours. You can reference this class from the feature_store.yaml by using the class path. We will automatically pick it up using https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/offline_stores/helpers.py#L8

Thanks for the info @woop, that's fine for me, I will go that way.

Hello @woop @YikSanChan,

FYI, I have implemented part of the Hive offline store, just want have a catch up with you, and probably get some feedbacks before continue.
could you have a look when you got time ?

The repo is here: https://github.com/baineng/feast-hive

Basically it's very similar to BigQuery implementation.

Some thoughts I have when checking Feast code:

  • The DataSource abstraction in source code are separated from offline_store and online_store, but they are actually coupled (BigQuery store relies on BigQuerySource, ...), probably DataSources should stay together with their implementations.
  • And some code about DataSource are hardcoded, not easy to extend. Such as DataSource.from_proto(data_source).

Some questions:

  • Should I add protos for HiveSource and HiveOptions in my repo as well?
  • I saw you guys have some discussion about removing query in the BigQuerySource, what do you think if i remove it from HiveOfflineStore ? (Hive should have more controls by end-users, I mean creating views)
  • I saw in get_historical_features() of BigQueryOfflineStore, it uploads entity_df to BigQuery first, then do point-in-time query.
    Think I will need to do the same for Hive, but there is no efficient way to upload a df to Hive besides uploading to HDFS directly. I don't want have HDFS client as another dependency, so I will choose to use multiple rows insert. what do you guys think?

FYI, I will change python client from Impyla to Ibis for better data writing support.

woop commented

Hello @woop @YikSanChan,

FYI, I have implemented part of the Hive offline store, just want have a catch up with you, and probably get some feedbacks before continue.
could you have a look when you got time ?

The repo is here: https://github.com/baineng/feast-hive

Basically it's very similar to BigQuery implementation.

Some thoughts I have when checking Feast code:

  • The DataSource abstraction in source code are separated from offline_store and online_store, but they are actually coupled (BigQuery store relies on BigQuerySource, ...), probably DataSources should stay together with their implementations.
  • And some code about DataSource are hardcoded, not easy to extend. Such as DataSource.from_proto(data_source).

Some questions:

  • Should I add protos for HiveSource and HiveOptions in my repo as well?
  • I saw you guys have some discussion about removing query in the BigQuerySource, what do you think if i remove it from HiveOfflineStore ? (Hive should have more controls by end-users, I mean creating views)
  • I saw in get_historical_features() of BigQueryOfflineStore, it uploads entity_df to BigQuery first, then do point-in-time query.
    Think I will need to do the same for Hive, but there is no efficient way to upload a df to Hive besides uploading to HDFS directly. I don't want have HDFS client as another dependency, so I will choose to use multiple rows insert. what do you guys think?

This is so awesome @baineng. Thank you for working on Hive support!

The DataSource abstraction in source code are separated from offline_store and online_store, but they are actually coupled (BigQuery store relies on BigQuerySource, ...), probably DataSources should stay together with their implementations.

Yes, we agree. Needs to be cleaned up.

And some code about DataSource are hardcoded, not easy to extend. Such as DataSource.from_proto(data_source).

Yea, we realize this as well. Good catch. It needs to be generalized.

Should I add protos for HiveSource and HiveOptions in my repo as well?

I believe the answer is yes. We need to store the source/config in the registry so it needs to exist, and I dont think it should be in the main repo.

I saw you guys have some discussion about removing query in the BigQuerySource, what do you think if i remove it from HiveOfflineStore ? (Hive should have more controls by end-users, I mean creating views)

I prefer not to introduce query.

saw in get_historical_features() of BigQueryOfflineStore, it uploads entity_df to BigQuery first, then do point-in-time query.
Think I will need to do the same for Hive, but there is no efficient way to upload a df to Hive besides uploading to HDFS directly. I don't want have HDFS client as another dependency, so I will choose to use multiple rows insert. what do you guys think?

You do have to do the same, but I share your intuition. It would be nice if we didn't couple to HDFS but instead wrote directly to Hive. Also, we can allow users to provide a reference to a table where their entity df is available within Hive, so we can skip the upload step. So at least there is a way for them to use Hive.

Thanks for your feedback @woop , it definitely helps me for next steps.

FYI, Since Feast internal codes related to OfflineStore and DataSource are changing, I postponed the Hive support implementation until next Feast release.

Thanks for the heads up @baineng , I think we're done with most of the refactoring. If you start development off of the master branch then you should be okay. Please let us know if you encounter any bugs!

Great, thanks for the update @achals, I will catch up and start soon.

FYI,
I have just published the first stable version to PyPi, think it's kind of ready for use now.
Please create an issue in the repo if you have met any problem.

@woop @achals Please have a review if you got time, will appreciate for any feedback.