hjacobs/kubernetes-failure-stories

Failure story: hub.docker.com slow (5-10kB/s)

marek-obuchowicz opened this issue · 7 comments

Short, easy, small story, but got on my nerves as I had to cancel a dinner.
In end effect it was a huge decision-changer for my company. All looked good till this failure... I decided to delay our k8s-based hosting offerings for now :|

I used k8s cluster(s) on AWS, provisioned with kops.

Friday, 5PM. Last task remaining for the week: change instance size(s) of k8s cluster. All services are correctly distributed over N nodes, what could possibly go wrong?

hub.docker.com's CDN. I'm not sure where it's hosted, but for some reason it became totally slow on AWS. Downloads of ca. 5-10kb/s. Works like a charm in another AWS regions or non-AWS datacenter. Just does not work for me.

So... I had to cancel my evening plans (and rendered the cluster unavailable), because:

  • each node tries to start for >30 minutes, then fails and is re-added. basic k8s services can't start, as container images can't be downloaded in reasonable. There's no quick fail - each time it tries to start, getting images is super slow and eventually times out
  • I got an open "kops update" in a terminal window on my local workstation. I found no information if I can safely break this operation. It will disconnect if unplug network cable from my laptop.

Solution:

  • cancel your dinner
  • wait for some hours until CDN bandwidth stabilizes
  • rethink many, many times, if our company should offer production k8s services...

@marek-obuchowicz thanks for your story! Do you want to write this up in a nice format somewhere (blog or similar)?

Possibly, yes. I think it will be valuable for community if it would also suggest a way to workaround this issue while it's happening, or preferably prevent it from happening. Do you have any suggestions?

@marek-obuchowicz we don't use Docker Hub directly as we already saw that it was dead slow years ago (and you don't want to rely on external dev platform without SLAs for your production workloads). I think one alternative on AWS is using ECR (and if necessary mirror images from Docker Hub).

I have to check with kops team if any of those solutions would be possible via kops configuration. If I find something worth sharing, I can get back to you with a nicer post. This is however not super high on the prio list, at least for next days which are pretty busy. Hope I'll be able to provide you something for the time of your conference talk (or however you want to publish it)

@marek-obuchowicz don't worry, I don't need it for a conference talk --- we have enough own failure stories to share 😏

I used kops in the past with ECR, and yes I've been to docker stand last eu kubecon to ask that, what is the timeline to shutdown the hub? I'm sure people mine coin and host movies there :p
But yeah still upish ;)

That's also why RedHat does recommend to host a local registry, just for that, kind of a local cache

Closing as there was no follow-up with a link.