kubeflow/testing

Alternative solution to removal of test on optional-test-infra

annajung opened this issue · 31 comments

[UPDATE June 14th, 2022] This is no longer a blocker for the 1.6 release as all WG have pivoted to using GitHub Actions as a short-term solution

This is a blocker for the 1.6 release

This issue is to track an alternative solution to the recent removal of existing presubmit and postsubmit tests on optional-test-infra.

As of May 31st,

  • Testing clusters are deleted and not available for use
  • All WGs who leveraged AWS optional-test-infra are blocked by the removal and need to find an alternative until a long term solution is found
  • KServe WG started working on migrating to GitHub actions with KinD clusters and shared as a possible alternative. However, this may not work for other WG due to resource constraints
  • AWS folks are willing to provide credits but need to work out logistics on creating a non-personal AWS account and how to get AWS credit associated with it. (Previously was using a personal AWS account)

References

@kubeflow/wg-automl-leads @kubeflow/wg-training-leads @kubeflow/wg-notebooks-leads @kubeflow/wg-pipeline-leads @kubeflow/wg-manifests-leads @pvaneck @yuzisun @kubeflow/release-team @akartsky @surajkota

Hi WG leads, there is a discussion happening in parallel with AWS, but I also like to kick off a discussion to see if you would be transitioning to a different CI/CD pipeline before the release. If so, do you have any timelines in mind?

jlewi commented

Hi folks; its been a minute.

As a bit of additional context. The issue of a sustainable and scalable approach to test infrastructure was identified almost two years ago in #737.

What is the current thinking in terms of making each WG responsible for its own test-infra?

@annajung I have created a new private slack channel with all the working group leads to organize the creation of separate AWS accounts for each Working Group, and get AWS credits applied to them.

As discussed in the Community Meeting 05/31, each Working Groups will need to setup their own testing infrastructure, as the optional-test-infra has been deleted.

AWS is willing to provide credits to the Working Groups for testing, to facilitate this please do the following:

  • Each WG must delegate a single “testing manager” who will be the point of contact between AWS and their WG about credits/etc
  • Each WG creates their own AWS account for any tests they may need to run
  • Each “testing manager” must send a AWS Pricing Calculator estimate of the costs for their testing infra (in this slack channel)
  • Each “testing manager” will work with AWS to get credits applied to their AWS account (in this slack channel)

The above is based on the assumption that a decentralised infra is more scalable and sustainable approach in the long term

@johnugeorge @kimwnasptd @yuzisun @james-jwu @annajung This seems like a very generous offer from @surajkota. Would you please reply in a timely manner with your perspective / response? Thanks!

Thanks @surajkota !!

Overall sounds good. Pipelines will continue to use GCP infra provided by Google. /cc @zijianjoy

Hi everyone, here is an update from the June 6th release team meeting

  • As a short-term solution, affected WGs are migrating to GitHub Actions for the 1.6 release while working in parallel for a long term solution
  • @pvaneck from KServe, @kimwnasptd from Notebooks and Manifest were present in the meeting and confirmed they will be able to meet the new feature freeze deadline of June 15th using their short term solution
  • @surajkota from AWS was present in the meeting and has taken action items to find out answers to a few questions about creating a personal AWS account and the use of the previous AWS registry, see details in the meeting notes
  • As mentioned by @surajkota before, each WG should go ahead and create an AWS account and work with Suraj and @akartsky to leverage AWS credit / infrastructure for testing

As for the 1.6 release, need to confirm with @johnugeorge @andreyvelich to see if Training Operators and AutoML WG can meet the June 15th feature freeze, if not, will work with them to determine if another extension is needed and update the community accordingly

If anyone needs assistance setting up github CI for their working group, etc., please reach out (on this issue or in Kubeflow Slack). I'm not an expert, but I have some experience and happy to put some time in.

Some notes from the June 7th Community meeting,

  • Johnu from Katib and Training Operators confirmed they are migrating to GitHub Actions and should be ready by June 15th
  • All WGs are focused on short term solutions and the 1.6 release to get themselves unblocked, will review the long term solutions afterward

update: I was trying to find out if there is a way to create an AWS account w/o payment info like credit card but havent found any documentation or a way to get exception for it. I will reach out to some more folks and will update if it is possible.

General requirement as per Customer Agreement, all AWS Accounts must have a valid form of payment to access our services. https://aws.amazon.com/agreement/

Hi folks, unfortunately, creating an AWS account requires adding a valid payment information. I could not find any way to request an exception for this and creating an internal AWS account would not work for our use case.

Here is a proposal which can aims to create a maintainable and sustainable path forward on this:

Creating a maintainable AWS Account

  1. Create a non-personal email which can we owned/shared by WG leads. This way it can be handed over if people change overtime.
  2. Create an AWS account associated with this email with valid payment information. Lets call this management account.
  3. Since we want to create separate accounts for each WG, instead of creating individual accounts, we can create an organization within the management account and create member accounts for each WG within this organization. The benefit is orgnanizations have consolidated billing and so the AWS credits can be applied to the management account and can be shared by member accounts.
    • With this approach WGs will have flexibility w.r.t account, for e.g., each working group can decide to create a testing account and a separate production account or maybe one production account for hosting released artifacts(samples, container images, charts etc.) for whole of Kubeflow

Making this approach sustainable in the long term

  1. Next, I want to propose creating a mechanism which can be used to ensure AWS account is accessible and funded appropriately throughout the year.
    • Add an item to the release checklist(in the beginning of release cycle) in which WGs:
      • i. Would review the credits spent and remaining over the last quarter and determine if there are sufficient credits to complete the current release. If there is a need to renew for the NEXT release, connect with AWS. This will allow sufficient time on both sides to complete the process
      • ii. Baseline the accounts to make sure only active members have permissions to the account
      • iii. Baseline the list of point of contact information from AWS
      • iv. Add/update the information related to accounts, maintainers, AWS points of contacts, infrastructure access etc. to a document or README
      • (Optional) We can add POC from AWS to have access to these resources if WGs thinks it is needed(this was brought up in some discussions)
  2. Set up Alarms to detect when account is running low on credits or spending exceeds expectation
  3. Setup best practices and guidelines for: adding people to account, deploy using IaC etc which can be flushed out later

Action item: The question about adding a valid payment information to the account still remains, and hence I would like to ask, is there any other organization/company which is willing to partner here for adding a valid payment information to the management account stated above?

Please let us know what the community thinks about this proposal.

cc @akartsky @jaypipes

@kubeflow/wg-manifests-leads @kubeflow/wg-automl-leads @kubeflow/wg-notebooks-leads @yuzisun @james-jwu

Please let me know what the community thinks about the above proposal assuming we have a partner for payment information. This will help us with #1008 as well and hence a timely response will be helpful.

Thank you very much for driving this @surajkota!

This proposal seems solid for allowing all WGs to share the same credit pool. Thumbs up from manifests and notebooks.

Add an item to the release checklist(in the beginning of release cycle) in which WGs

I really like this approach as well. Since it will ensure we have a cadence for the status checks.

@surajkota thanks for moving this forward. I have been asking companies (that provide integration services for Kubeflow on AWS) to support of this effort. I believe that we need to scope the effort i.e. One headcount is needed to 1) manage the accounts and credits and 2) config, operate, tear-down the clusters on the testing infra. 3) the period of time i.e. 12 months. Additionally, the responsibilities and SLA need to be defined i.e. only for current release i.e. 1.6, change requests will be tracked, acknowledged and implemented based on a simple approval process. Finally, IMO, the companies that provide testing infrastructure and related services so be given a special designation by the Community. This is an investment that test infra operators are making and (IMO) the Community should provide a designation / benefit back to these contributors.

Thank you @surajkota. The proposal looks solid. We do a lot of work with Kubeflow, my company MavenCode will be able to provide the needed partnership support to get this going.

@kubeflow/wg-manifests-leads @kubeflow/wg-automl-leads @kubeflow/wg-notebooks-leads @yuzisun @james-jwu

Please let me know what the community thinks about the above proposal assuming we have a partner for payment information. This will help us with #1008 as well and hence a timely response will be helpful.

Thank you @surajkota @jbottum for the proposal. We are very much interested in contributing to the effort. I am part of dkube.io and our product DKube is built on top of Kubeflow and MLflow and provides MLOps and Monitoring solutions to enterprise customers.

We look forward to partnering with other community members and providing the needed support

@surajkota I believe that we said that interested parties should respond by COB today. It appears that we have Arrikto, Maven Code, One Convergence and @ca-scribner offering support. @annajung Perhaps we should ask the contributors to select a Working Group to support? @kimwnasptd @charlesa101 @songole do you have a preference for a working group to support ? I think it would be good to have representatives 2+ companies in each working group.

@jbottum We like to represent the following working groups: AutoML, Pipelines, Training and Serving.

@surajkota I believe that we said that interested parties should respond by COB today. It appears that we have Arrikto, Maven Code, One Convergence and @ca-scribner offering support. @annajung Perhaps we should ask the contributors to select a Working Group to support? @kimwnasptd @charlesa101 @songole do you have a preference for a working group to support ? I think it would be good to have representatives 2+ companies in each working group.

@jbottum - automl, notebook, manifest, pipelines but we are open to support any other WG

@surajkota did you get a credit card for the AWS account from a partner? Do you need the credit card to move forward?

@kimwnasptd @songole @charlesa101 @ca-scribner In the Release team meeting today, we discussed next steps. We propose that the parties interested (Maven Code, One, Arrikto and CA-Scribner) should contact the Working Groups, and create a PR for the test-infra config and operations effort. The Issue/PR should propose a design for the test infra and support. Is that a reasonable request ?

Please note that this issue (1006) will be used to track the account set-up, and the config and operations of the test infra for each working group should have an independent issue / PR. @surajkota @annajung @DomFleischmann please confirm that I captured this correctly. Thanks.

@johnugeorge @kimwnasptd @pvaneck - @surajkota needs an estimate of each Working Group's expenses for the next 12 months. Please submit by Friday(July 1) for Manifests, Notebooks, Training, Katib/AutoML, and KService. Please use AWS Pricing Calculator (https://calculator.aws/#/). cc'ing @annajung

Hi everyone, the initial proposal required us to attach a credit card per WG account. The current proposal that uses AWS Organization approach requires only one credit card which needs to be added to the management account since it offers consolidated billing. I propose that we move forward with Arrikto's payment information for the management account since @kimwnasptd has been testing it out and was the first one to respond.

Thank you Maven Code, One Convergence, Arrikto and CA-Scribner for the interest in this initiative. Creating the management account is the first step of this project. It is exciting to see all the folks who are interested to contribute to this effort and I am confident the WGs will appreciate all the help they can get to make this effort useful for the product!

@surajkota One question. Will credit card again become a single point of failure for the management account similar to earlier personal account for AWS infra ? How can this be handled?

@johnugeorge All credits cards have an expiration date so adding more than one would not be adding much value IMO. If one company wants to remove their payment info in future, we will do another callout and also have this issue as reference in case we want to reach out to others who expressed interest in this.

Apologies for the late reply here. First of all @songole @charlesa101 nice to meet you! I'm sure WGs would be more that happy to have some more engineering firepower for the testing, thank you very much for the interest!

As @surajkota described above we are splitting the testing infra migration into 2 orthogonal efforts:

  1. Establishing a process for a team that will be responsible for the root AWS account, that will be funded from AWS, as well as how the WGs can use that account in a secure manner
  2. Deciding on how the testing infra for the affected WGs will be, how will we set up CI/CD, ECR registries etc.

1. Management root account

For the first part we have made progress and created the initial management account and we will create an AWS organization, in which WGs can join with an email they will own. The practical part for this is almost done, and what remains is for each WG to use the AWS Pricing Calculator to estimate their credit needs for the next 12 months.

The pricing calculation part is crucial as without it we can't bootstrap the process. So we kindly ask the WGs interested in this to provide such an estimate by the end of this week, early next one. This will hugely help @surajkota as well to push for this, since this will require some communication to get the credits in.

For Notebooks WG @thesuperzapper and I are already in the process of calculating the cost and will post an update tomorrow.

Lastly we are preparing a basic proposal for on the team responsible for the management account. Specifically we want to document:

  1. What are the selection criteria for members of that team
  2. What are the expectations and time commitments from that team
  3. Actions and setup that needs to happen within that account

2. Setting up the infra per WG

@songole @charlesa101 @ca-scribner for this part I highly suggest to reach out to the WGs you are interested it to discuss your thoughts and expertise on how to setup the CI/CD. We can then form proposals and even generalizing a solution across WGs once we have a solid understanding and approach.

You can find links for all the WG's calendars and info in https://github.com/kubeflow/community/blob/master/wgs.yaml#L80

cc @kubeflow/wg-automl-leads @kubeflow/wg-training-leads @pvaneck @yuzisun

Hi everyone, thanks to all the WGs for the estimates, we got the management account created and credits approved!

Next steps: I have a draft of the design with the next steps on this document:
https://docs.google.com/document/d/1Z3K4q21Vko6SzQDu2JSov9DO2fRehsDB_X9Z663fym4/edit?usp=sharing and we will be looking into setting up the AWS organization to get this going.

As we previously discussed, each WG can choose its own test/release infrastructure depending on their requirements and it would be running in separate accounts. I am looking for contributors and WGs to come up with the requirements for the testing infrastructure or a proposal for the infrastructure based on their requirements (Infrastructure per WG section of the doc). I have laid out a high level expectation for each of the section, please ping me or request access on the doc if you would like to contribute. Can we target to have a draft by 07/27?
cc @songole @charlesa101 @ca-scribner @kubeflow/wg-automl-leads @kubeflow/wg-training-leads @kubeflow/wg-notebooks-leads @kubeflow/wg-manifests-leads @pvaneck @yuzisun

Thanks @surajkota. Someone from my team would start with training wg. @mak-454 @anil3

Thanks, @surajkota - We can start with the notebooks & manifest wg. Thank you!

Hi @kubeflow/wg-automl-leads, @kubeflow/wg-training-leads @kubeflow/wg-manifests-leads, @kubeflow/wg-notebooks-leads, @pvaneck

We are looking into creating the AWS organization and organization units for each WG using Infrastructure as code based on the proposal in this doc. Everyone has already looked at brief overview on this issue but the document will go into details. If you have any comments or would like to contribute to any of the TODO items, please let us know.

Following are the things we need from your end:

  1. We want to use an IaC tool and not have manual creation for the AWS organization. Does the community have a preference for using CDK or Terraform?
  2. Please help me with an email addresses by EOD 07/20 you would like to use for your WG account. Let me know if we should go ahead and create one on your behalf. We can create something like: kf-wg-manifests-test@gmail.com , kf-wg-training-test@gmail.com and share it with each of the WGs.

Please let me know if you want to designate anyone else in the WG for this.