Migrating from kube-aws 0.14 to 0.15 issues
paalkr opened this issue · 5 comments
@andersosthus and I have test migrated a few clusters from kube-aws 0.14.3 to 0.15.1/0.15.2, and we have discovered a few issues
-
#1832 introduced a problem with the cloud-controller-manager #1833. The issue was fixed in #1834 and included in the kube-aws 0.15.2 release, thanks @davidmccormick
-
Using
etcd.memberIdentityProvider: eni
introduces a problem when cleaning up the etcd stack after migration, because the control-plane stack imports the Etcd0PrivateIP, Etcd1PrivateIP and Etcd2PrivateIP exports. These exports are not part of the rendered etcd CloudFormation stacks in 0.15. A temporary workaround is to edit the etcd.json.tmpl after dong a render stack, to temporary add in the missing exports. This will make sure that that update can continue, and then the added values can be removed and a new updated issued.
},
"Outputs": {
"Etcd0PrivateIP": {
"Description": "The private IP for etcd node 0",
"Value": "10.9.151.115",
"Export": {
"Name": {
"Fn::Sub": "${AWS::StackName}-Etcd0PrivateIP"
}
}
},
"Etcd1PrivateIP": {
"Description": "The private IP for etcd node 1",
"Value": "10.9.180.114",
"Export": {
"Name": {
"Fn::Sub": "${AWS::StackName}-Etcd1PrivateIP"
}
}
},
"Etcd2PrivateIP": {
"Description": "The private IP for etcd node 2",
"Value": "10.9.219.3",
"Export": {
"Name": {
"Fn::Sub": "${AWS::StackName}-Etcd2PrivateIP"
}
}
},
"StackName": {
"Description": "The name of this stack which is used by node pool stacks to import outputs from this stack",
"Value": { "Ref": "AWS::StackName" }
}
{{range $index, $etcdInstance := $.EtcdNodes }},
"{{$etcdInstance.LogicalName}}FQDN": {
"Description": "The FQDN for etcd node {{$index}}",
"Value": {{$etcdInstance.AdvertisedFQDN}}
}
{{- end}}
{{range $n, $r := .ExtraCfnOutputs -}}
,
{{quote $n}}: {{toJSON $r}}
{{- end}}
}
}
- The
export-existing-etcd-state.service
responsible for exporting from the old etcd cluster and preparing the export files on disk in /var/run/coreos/etcdadm/snapshots takes so long that the CF-stack might do a rollback. Even on a close to "empty" cluster the migration can take many minutes. Migrating a very small cluster with only a few resource took 45 minutes.
The etcd stack rollback timeout is based on the CreateTimeout of the controller https://github.com/kubernetes-incubator/kube-aws/blob/b34d9b69069321111d3ca3e24c53fdba8ccecd2c/builtin/files/stack-templates/etcd.json.tmpl#L365, which is a little confusing. You will actually have to increase the controller.createTimeout
to increase the etcd wait time.
CloudFormation does not allow for more then 60 minutes wait time, so I fear that the etcd migration process will not work for lager clusters.
Using a WaitCondition https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-waitcondition.html that receives a heartbeat signal from the migration script might be a functional approach.
WaitForEtcdMigration:
Type: AWS::CloudFormation::WaitCondition
CreationPolicy:
ResourceSignal:
Timeout: PT2H # can be more then 60 minutes
Count: 1
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
I think this is important enough to /remove-lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten