bentoml/Yatai

Yatai 2.0 Proposal

parano opened this issue · 2 comments

Introduction and background

It has been over 1 year since the initial release of Yatai 1.0 and thank you for all your trust and support for the project!

Recognition and Highlights

We've learnt a lot from working with the BentoML developer community and are planning for a major update to Yatai project. In this post, I'd like to discuss our thoughts on the direction we are taking, gathering feedback, and call for OSS contributors to join us in building Yatai version 2.0.

Learnings

  • Setup complexity: the number 1 complain from Yatai users, is that the K8s setup is way too intrusive from DevOps and operational perspective. Currently Yatai requires multiple stateful components (e.g. a Database) and multiple namespaces (for image building, system components, deployed workloads). Yatai's own user system doesn't work nicely with the RBAC system in Kubernetes. Although many of the setup complexity comes from a relatively small UX gains.

  • Custom image building process: this is the top requested feature in BentoML slack community (also #483): can user put a pre-built BentoML docker image’s registry URL in BentoDeployment yaml instead of a bento tag. The yatai image builder is very hard to optimize or debug(e.g. #457) without heavy investment on the cloud infrastructure side, making it less useful than its original intention.

  • Customize the deployment process. Adding custom labels or resource annotation to your deployed containers, e.g. #486.

  • Stateful components like Databases are either very expensive (e.g. using RDS) or unreliable/hard to operate (In-cluster instances).

  • Elastic 2 license is relatively restricting and limit who we can partner to build this project

Goals of Yatai 2.0

The main goal for 2.0, is to focus on our core value proposition: Yatai was built for scaling BentoML deployments on Kubernetes and that's the main reason most teams come to Yatai. We'd like to double down on that single value proposition and making it work extremely well towards to promise of scalable AI deployment. This also means we may reduce features that increases the complexity without contributing to the core benefits.

Simplify setup for DevOps teams

  • Offer a single “BentoDeployment CRD controller” component via Helm and fully embrace the cloud native design, simplifying both the onboarding and advanced customizations.
  • Allow custom integration with other cloud-native tools, such as Knative, ArgoCD, Istio, Prometheus/Grafana, Jaeger, Elasticsearch, Loki, etc.
  • Remove stateful components (RDS, Docker Registry) and replace with support for custom docker registry and S3 based model/bento store.

Features Highlights

  • Support for Distributed Service deployment mode in BentoML 1.2
  • Offer optional "playbooks" for additional features such as ingress settings, monitoring setup, deployment dashboard, and model store integrations.
  • Offer a Kubernetes native workflow that integrates well with the boarder eco-system (e.g. support K8s RBAC Authorization, custom ingress control, and ArgoCD deployment pipeline)
  • Support for external message queue for long-running inference tasks and async API endpoints

Open Governance.

  • Moving from Elastic 2 to Apache 2.0 License
  • Explore partnership opportunities with open source foundations. Contact us!
  • Call for contributors. Join the #yatai channel, introduce yourself and share your thoughts.

Tentative Timeline and Milestones

  • March-May, 2024: Gathering CFP feedback and finish initial design draft and
  • May-August, 2024: Community meeting on project updates
  • September-October, 2024: Yatai 2.0 Beta Release

Migration to 2.0

Due to the change in scope, we expect Yatai 2.0 to have some incompatible APIs comparing to 1.0. The exact migration plan will need additional design and pending on some of the design decisions. We will provide office hours by the time of 2.0 release, in assisting your team with the migration process.

Call for Contributors

BentoML is a small team supporting many customers and community users. We'd love to get your help in building Yatai 2.0! We will need help with writing code, docs, testing, and early feedback on its design.

To get started, please join the #yatai channel in BentoML slack community, introduce yourself and join our Yatai community meeting.

So exciting!

Re: removing stateful components.

Would this include secrets management?

Re: the change in scope will mean breaking changes

Is this primarily referring to

  1. the removal of stateful components
  2. Deprecation of the ImageBuilder workflow where docker images are built from bentos once they are selected for a deployment?

Question: Are additional potential enhancements to Yatai outside the scope of this proposal? E.g.

  • Async endpoints option
  • A/B testing of bentos
  • Shipping of logs and metrics to 3rd party backends e.g. Datadog/NewRelic

Question: would development of Yatai enhancements (such as these) be blocked until the release of 2.0?

Question: Is it possible that Yatai 2.0 could be decoupled from Kubernetes? I.e. run on other orchestrators such as OpenShift, AWS ECS, etc.? (and Kubernetes as well)

hutm commented

is there a timeline for 2.0?