kubernetes-retired/kube-aws

Migrating from kube-aws 0.14 to 0.15 issues

paalkr opened this issue · 5 comments

@andersosthus and I have test migrated a few clusters from kube-aws 0.14.3 to 0.15.1/0.15.2, and we have discovered a few issues

  1. #1832 introduced a problem with the cloud-controller-manager #1833. The issue was fixed in #1834 and included in the kube-aws 0.15.2 release, thanks @davidmccormick

  2. Using etcd.memberIdentityProvider: eni introduces a problem when cleaning up the etcd stack after migration, because the control-plane stack imports the Etcd0PrivateIP, Etcd1PrivateIP and Etcd2PrivateIP exports. These exports are not part of the rendered etcd CloudFormation stacks in 0.15. A temporary workaround is to edit the etcd.json.tmpl after dong a render stack, to temporary add in the missing exports. This will make sure that that update can continue, and then the added values can be removed and a new updated issued.

  },
  "Outputs": {
    "Etcd0PrivateIP": {
      "Description": "The private IP for etcd node 0",
      "Value": "10.9.151.115",
      "Export": {
        "Name": {
          "Fn::Sub": "${AWS::StackName}-Etcd0PrivateIP"
        }
      }
    },
    "Etcd1PrivateIP": {
      "Description": "The private IP for etcd node 1",
      "Value": "10.9.180.114",
      "Export": {
        "Name": {
          "Fn::Sub": "${AWS::StackName}-Etcd1PrivateIP"
        }
      }
    },
    "Etcd2PrivateIP": {
      "Description": "The private IP for etcd node 2",
      "Value": "10.9.219.3",
      "Export": {
        "Name": {
          "Fn::Sub": "${AWS::StackName}-Etcd2PrivateIP"
        }
      }
    },    
    "StackName": {
      "Description": "The name of this stack which is used by node pool stacks to import outputs from this stack",
      "Value": { "Ref": "AWS::StackName" }
    }
    {{range $index, $etcdInstance := $.EtcdNodes }},
    "{{$etcdInstance.LogicalName}}FQDN": {
      "Description": "The FQDN for etcd node {{$index}}",
      "Value": {{$etcdInstance.AdvertisedFQDN}}
    }
    {{- end}}
    {{range $n, $r := .ExtraCfnOutputs -}}
    ,
    {{quote $n}}: {{toJSON $r}}
    {{- end}}
  }
}
  1. The export-existing-etcd-state.service responsible for exporting from the old etcd cluster and preparing the export files on disk in /var/run/coreos/etcdadm/snapshots takes so long that the CF-stack might do a rollback. Even on a close to "empty" cluster the migration can take many minutes. Migrating a very small cluster with only a few resource took 45 minutes.

The etcd stack rollback timeout is based on the CreateTimeout of the controller https://github.com/kubernetes-incubator/kube-aws/blob/b34d9b69069321111d3ca3e24c53fdba8ccecd2c/builtin/files/stack-templates/etcd.json.tmpl#L365, which is a little confusing. You will actually have to increase the controller.createTimeout to increase the etcd wait time.

CloudFormation does not allow for more then 60 minutes wait time, so I fear that the etcd migration process will not work for lager clusters.
Using a WaitCondition https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-waitcondition.html that receives a heartbeat signal from the migration script might be a functional approach.

  WaitForEtcdMigration:
    Type: AWS::CloudFormation::WaitCondition
    CreationPolicy:
      ResourceSignal:
        Timeout: PT2H # can be more then 60 minutes
        Count: 1

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

I think this is important enough to /remove-lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten