Research: pipelines via PipelineAI, Pachyderm, StreamSets

Question

Research: pipelines via PipelineAI, Pachyderm, StreamSets

williewheeler opened this issue 6 years ago · 3 comments

Research PipelineAI, Pachyderm, StreamSets to determine whether either might be a good fit for our ML pipeline needs. Some key needs would include:

Shipping data to S3 (ideally would like to continue being able to use AWS Glue+Athena here)
Scheduling model build jobs
Support for HPO and automated model selection
Hosting training/prediction workloads (any language and ML framework)

Answer 1 · 2018-10-31T03:00:38.000Z

@djsutho @tkamenov-expedia Once you start this card, can you prioritize helping us understand what the data sync piece looks like? We need this to get model training in place.

Answer 2 · 2018-11-05T04:24:39.000Z

I have started researching PipelineAI. It supports following environments: Hosted Community Edition, Docker, Kubernetes and AWS SageMaker. So far I have tested and served sample predictions from Hosted Community Edition (online solution) and Docker (locally). It seems to be efficient for ML pipelines (train and predict) but we might want to use some specialised data pipeline with PipelineAI to prepare data.

Answer 3 · 2019-02-19T14:04:49.000Z

Closing, as we decided to treat ML pipeline as external to AA's scope. AA proper is just the runtime from aa-metrics to the anomalies topic. Internally we do have an attached pipeline but it's not part of AA itself.