Create test non-blocking job to automatically run on all/most-relevant PRs
adelina-t opened this issue · 7 comments
Recent changes in kubernetes have left our test passes failing to create the cluster in a couple of instances. Examples: flags changed, features removed that aks-engine did not know how to handle. This resulted in the periodic jobs being down for a period until we managed to identify the offending PRs and accommodate the changes in aks-engine.
A non-blocking presubmit job that would run on all / most-relevant PRs in kubernetes/kubernetes would be very useful in helping us find these situations ahead of time.
There are a few blockers for this ideea:
- we do not have the resources to run a full Conformance run on each and every PR.
- A conformance run, like the staging one takes about 2 - 2 1/2 hours to complete and there is a high chance that we might have flakes.
To get around these issues, I propose that we:
- Run the job only on a subset of changes: we can filter by file location. For example, there is no need to run tests if a change happens in kubernetes/cluster or kubernetes/hack etc.
- Run only a subset of Conformance tests, that are known to have a high chance of passing. This will reduce the time for every run.
A proposed list of tests :
[k8s.io] [sig-node] Events should be sent by kubelets and the scheduler about pods scheduling and running [Conformance]
[k8s.io] [sig-node] Pods Extended [k8s.io] Delete Grace Period should be submitted and removed [Conformance]
[k8s.io] [sig-node] PreStop should call prestop when killing a pod [Conformance]
[k8s.io] Container Lifecycle Hook when create a pod with lifecycle hook should execute poststart exec hook properly [NodeConformance] [Conformance]
[k8s.io] Container Lifecycle Hook when create a pod with lifecycle hook should execute prestop http hook properly [NodeConformance] [Conformance]
[k8s.io] Container Runtime blackbox test when running a container with a new image should be able to pull from private registry with secret [NodeConformance]
[k8s.io] InitContainer [NodeConformance] should invoke init containers on a RestartAlways pod [Conformance]
[k8s.io] Pods should get a host IP [NodeConformance] [Conformance]
[k8s.io] Pods should support remote command execution over websockets [NodeConformance] [Conformance]
[k8s.io] Pods should support retrieving logs from the container over websockets [NodeConformance] [Conformance]
[k8s.io] Probing container should *not* be restarted with a /healthz http liveness probe [NodeConformance] [Conformance]
[k8s.io] Probing container should be restarted with a /healthz http liveness probe [NodeConformance] [Conformance]
[sig-api-machinery] Secrets should be consumable from pods in env vars [NodeConformance] [Conformance]
[sig-apps] Deployment deployment should delete old replica sets [Conformance]
[sig-apps] Deployment deployment should support rollover [Conformance]
[sig-apps] Deployment RecreateDeployment should delete old pods and create new ones [Conformance]
[sig-apps] ReplicaSet should adopt matching pods on creation and release no longer matching pods [Conformance]
[sig-apps] ReplicationController should release no longer matching pods [Conformance]
[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications [Conformance]
[sig-network] DNS should provide DNS for ExternalName services [Conformance]
[sig-network] DNS should provide DNS for services [Conformance]
[sig-network] DNS should provide DNS for the cluster [Conformance]
[sig-network] Proxy version v1 should proxy logs on node using proxy subresource [Conformance]
[sig-network] Proxy version v1 should proxy logs on node with explicit kubelet port using proxy subresource [Conformance]
[sig-network] Proxy version v1 should proxy through a service and a pod [Conformance]
[sig-network] Services should serve a basic endpoint from pods [Conformance]
[sig-network] Services should serve multiport endpoints from pods [Conformance]
[sig-node] ConfigMap should be consumable via the environment [NodeConformance] [Conformance]
[sig-node] Downward API should provide container's limits.cpu/memory and requests.cpu/memory as env vars [NodeConformance] [Conformance]
[sig-node] Downward API should provide pod name, namespace and IP address as env vars [NodeConformance] [Conformance]
[sig-storage] ConfigMap binary data should be reflected in volume [NodeConformance] [Conformance]
[sig-storage] ConfigMap should be consumable from pods in volume [NodeConformance] [Conformance]
[sig-storage] Downward API volume should provide container's cpu limit [NodeConformance] [Conformance]
[sig-storage] Downward API volume should provide podname only [NodeConformance] [Conformance]
[sig-storage] Downward API volume should update labels on modification [NodeConformance] [Conformance]
[sig-storage] EmptyDir volumes pod should support shared volumes between containers [Conformance]
[sig-storage] EmptyDir wrapper volumes should not conflict [Conformance]
[sig-storage] HostPath should support r/w [NodeConformance]
[sig-storage] HostPath should support subPath [NodeConformance]
[sig-storage] Projected combined should project all components that make up the projection API [Projection][NodeConformance] [Conformance]
[sig-storage] Projected configMap should be consumable from pods in volume with mappings [NodeConformance] [Conformance]
[sig-storage] Projected downwardAPI should provide container's memory request [NodeConformance] [Conformance]
[sig-storage] Projected downwardAPI should provide node allocatable (memory) as default memory limit if the limit is not set [NodeConformance] [Conformance]
[sig-storage] Projected secret should be consumable in multiple volumes in a pod [NodeConformance] [Conformance]
[sig-storage] Secrets optional updates should be reflected in volume [NodeConformance] [Conformance]
[sig-storage] Secrets should be able to mount in a volume regardless of a different secret existing with same name in different namespace [NodeConformance] [Conformance]
- we do not have the resources to run a full Conformance run on each and every PR.
- A conformance run, like the staging one takes about 2 - 2 1/2 hours to complete and there is a high chance that we might have flakes.
I think "conformance" tests should not take that long to run to completion. In our (GCE) tests, it's around 90 minutes.
- Run only a subset of Conformance tests, that are known to have a high chance of passing. This will reduce the time for every run.
@adelina-t what are the flaky tests, and do you know why they aren't stable?
I think "conformance" tests should not take that long to run to completion. In our (GCE) tests, it's around 90 minutes.
For us it's actually building the binaries & creating the cluster ( deployment + prepulling images ) that take around one hour :( . The actual testing on a clean run takes on average 1h. In GCE I see the tests actually take < 1h to complete, but we run on less parallel test nodes: you have 8 I believe, we run on only 4.
@adelina-t what are the flaky tests, and do you know why they aren't stable?
DNS tests have a higher chance of failing in the aks-engine deployed clusters. Most probably because of azure-cni. We're experimenting with newer versions of azure-cni to see if / when we can move testing to that.
For us it's actually building the binaries & creating the cluster ( deployment + prepulling images ) that take around one hour :( . The actual testing on a clean run takes on average 1h. In GCE I see the tests actually take < 1h to complete, but we run on less parallel test nodes: you have 8 I believe, we run on only 4.
I see. We reserve 15 minutes just for prepulling the test images.
Have you tried running with more parallelism? We have 3 windows nodes in a test cluster, and we use 8 (like you said), but I haven't tried bumping it above 8 to see if it'd still be stable. More test VMs may also help...
DNS tests have a higher chance of failing in the aks-engine deployed clusters. Most probably because of azure-cni. We're experimenting with newer versions of azure-cni to see if / when we can move testing to that.
Ack. I think we should target running all (compatible) conformance tests, and blacklist tests (e.g., the dns tests) for certain cloud provider explicitly to work around known issues.
Maintaining a different list would add more maintenance overhead for this project. We should try avoiding that if we can... :-)
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
@fejta-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue with/reopen
.
Mark the issue as fresh with/remove-lifecycle rotten
.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.