ExpediaGroup/adaptive-alerting

Research: pipelines via PipelineAI, Pachyderm, StreamSets

williewheeler opened this issue · 3 comments

Research PipelineAI, Pachyderm, StreamSets to determine whether either might be a good fit for our ML pipeline needs. Some key needs would include:

  • Shipping data to S3 (ideally would like to continue being able to use AWS Glue+Athena here)
  • Scheduling model build jobs
  • Support for HPO and automated model selection
  • Hosting training/prediction workloads (any language and ML framework)

@djsutho @tkamenov-expedia Once you start this card, can you prioritize helping us understand what the data sync piece looks like? We need this to get model training in place.

I have started researching PipelineAI. It supports following environments: Hosted Community Edition, Docker, Kubernetes and AWS SageMaker. So far I have tested and served sample predictions from Hosted Community Edition (online solution) and Docker (locally). It seems to be efficient for ML pipelines (train and predict) but we might want to use some specialised data pipeline with PipelineAI to prepare data.

Closing, as we decided to treat ML pipeline as external to AA's scope. AA proper is just the runtime from aa-metrics to the anomalies topic. Internally we do have an attached pipeline but it's not part of AA itself.