kubernetes-sigs/windows-testing

Create test non-blocking job to automatically run on all/most-relevant PRs

adelina-t opened this issue · 7 comments

Recent changes in kubernetes have left our test passes failing to create the cluster in a couple of instances. Examples: flags changed, features removed that aks-engine did not know how to handle. This resulted in the periodic jobs being down for a period until we managed to identify the offending PRs and accommodate the changes in aks-engine.

A non-blocking presubmit job that would run on all / most-relevant PRs in kubernetes/kubernetes would be very useful in helping us find these situations ahead of time.

There are a few blockers for this ideea:

  1. we do not have the resources to run a full Conformance run on each and every PR.
  2. A conformance run, like the staging one takes about 2 - 2 1/2 hours to complete and there is a high chance that we might have flakes.

To get around these issues, I propose that we:

  1. Run the job only on a subset of changes: we can filter by file location. For example, there is no need to run tests if a change happens in kubernetes/cluster or kubernetes/hack etc.
  2. Run only a subset of Conformance tests, that are known to have a high chance of passing. This will reduce the time for every run.

A proposed list of tests :

[k8s.io] [sig-node] Events should be sent by kubelets and the scheduler about pods scheduling and running  [Conformance]
[k8s.io] [sig-node] Pods Extended [k8s.io] Delete Grace Period should be submitted and removed [Conformance]
[k8s.io] [sig-node] PreStop should call prestop when killing a pod  [Conformance]
[k8s.io] Container Lifecycle Hook when create a pod with lifecycle hook should execute poststart exec hook properly [NodeConformance] [Conformance]
[k8s.io] Container Lifecycle Hook when create a pod with lifecycle hook should execute prestop http hook properly [NodeConformance] [Conformance]
[k8s.io] Container Runtime blackbox test when running a container with a new image should be able to pull from private registry with secret [NodeConformance]
[k8s.io] InitContainer [NodeConformance] should invoke init containers on a RestartAlways pod [Conformance]
[k8s.io] Pods should get a host IP [NodeConformance] [Conformance]
[k8s.io] Pods should support remote command execution over websockets [NodeConformance] [Conformance]
[k8s.io] Pods should support retrieving logs from the container over websockets [NodeConformance] [Conformance]
[k8s.io] Probing container should *not* be restarted with a /healthz http liveness probe [NodeConformance] [Conformance] 
[k8s.io] Probing container should be restarted with a /healthz http liveness probe [NodeConformance] [Conformance]
[sig-api-machinery] Secrets should be consumable from pods in env vars [NodeConformance] [Conformance]
[sig-apps] Deployment deployment should delete old replica sets [Conformance]
[sig-apps] Deployment deployment should support rollover [Conformance]
[sig-apps] Deployment RecreateDeployment should delete old pods and create new ones [Conformance]
[sig-apps] ReplicaSet should adopt matching pods on creation and release no longer matching pods [Conformance]
[sig-apps] ReplicationController should release no longer matching pods [Conformance]
[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications [Conformance]
[sig-network] DNS should provide DNS for ExternalName services [Conformance]
[sig-network] DNS should provide DNS for services  [Conformance]
[sig-network] DNS should provide DNS for the cluster  [Conformance]
[sig-network] Proxy version v1 should proxy logs on node using proxy subresource  [Conformance]
[sig-network] Proxy version v1 should proxy logs on node with explicit kubelet port using proxy subresource  [Conformance]
[sig-network] Proxy version v1 should proxy through a service and a pod  [Conformance]
[sig-network] Services should serve a basic endpoint from pods  [Conformance]
[sig-network] Services should serve multiport endpoints from pods  [Conformance]
[sig-node] ConfigMap should be consumable via the environment [NodeConformance] [Conformance]
[sig-node] Downward API should provide container's limits.cpu/memory and requests.cpu/memory as env vars [NodeConformance] [Conformance]
[sig-node] Downward API should provide pod name, namespace and IP address as env vars [NodeConformance] [Conformance]
[sig-storage] ConfigMap binary data should be reflected in volume [NodeConformance] [Conformance]
[sig-storage] ConfigMap should be consumable from pods in volume [NodeConformance] [Conformance]
[sig-storage] Downward API volume should provide container's cpu limit [NodeConformance] [Conformance]
[sig-storage] Downward API volume should provide podname only [NodeConformance] [Conformance]
[sig-storage] Downward API volume should update labels on modification [NodeConformance] [Conformance]
[sig-storage] EmptyDir volumes pod should support shared volumes between containers [Conformance]
[sig-storage] EmptyDir wrapper volumes should not conflict [Conformance]
[sig-storage] HostPath should support r/w [NodeConformance]
[sig-storage] HostPath should support subPath [NodeConformance]
[sig-storage] Projected combined should project all components that make up the projection API [Projection][NodeConformance] [Conformance]
[sig-storage] Projected configMap should be consumable from pods in volume with mappings [NodeConformance] [Conformance]
[sig-storage] Projected downwardAPI should provide container's memory request [NodeConformance] [Conformance]
[sig-storage] Projected downwardAPI should provide node allocatable (memory) as default memory limit if the limit is not set [NodeConformance] [Conformance]
[sig-storage] Projected secret should be consumable in multiple volumes in a pod [NodeConformance] [Conformance]
[sig-storage] Secrets optional updates should be reflected in volume [NodeConformance] [Conformance]
[sig-storage] Secrets should be able to mount in a volume regardless of a different secret existing with same name in different namespace [NodeConformance] [Conformance]
  • we do not have the resources to run a full Conformance run on each and every PR.
  • A conformance run, like the staging one takes about 2 - 2 1/2 hours to complete and there is a high chance that we might have flakes.

I think "conformance" tests should not take that long to run to completion. In our (GCE) tests, it's around 90 minutes.

  • Run only a subset of Conformance tests, that are known to have a high chance of passing. This will reduce the time for every run.

@adelina-t what are the flaky tests, and do you know why they aren't stable?

I think "conformance" tests should not take that long to run to completion. In our (GCE) tests, it's around 90 minutes.

For us it's actually building the binaries & creating the cluster ( deployment + prepulling images ) that take around one hour :( . The actual testing on a clean run takes on average 1h. In GCE I see the tests actually take < 1h to complete, but we run on less parallel test nodes: you have 8 I believe, we run on only 4.

@adelina-t what are the flaky tests, and do you know why they aren't stable?

DNS tests have a higher chance of failing in the aks-engine deployed clusters. Most probably because of azure-cni. We're experimenting with newer versions of azure-cni to see if / when we can move testing to that.

For us it's actually building the binaries & creating the cluster ( deployment + prepulling images ) that take around one hour :( . The actual testing on a clean run takes on average 1h. In GCE I see the tests actually take < 1h to complete, but we run on less parallel test nodes: you have 8 I believe, we run on only 4.

I see. We reserve 15 minutes just for prepulling the test images.

Have you tried running with more parallelism? We have 3 windows nodes in a test cluster, and we use 8 (like you said), but I haven't tried bumping it above 8 to see if it'd still be stable. More test VMs may also help...

DNS tests have a higher chance of failing in the aks-engine deployed clusters. Most probably because of azure-cni. We're experimenting with newer versions of azure-cni to see if / when we can move testing to that.

Ack. I think we should target running all (compatible) conformance tests, and blacklist tests (e.g., the dns tests) for certain cloud provider explicitly to work around known issues.
Maintaining a different list would add more maintenance overhead for this project. We should try avoiding that if we can... :-)

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.