Comcast/sirius

Slow node eventually DDOS' itself attempting to catch up with others

joercampbell opened this issue · 2 comments

When a single node in a given cluster is slow (in our case, had a network interface operating @ 10Mb/s when its neighbors were operating @ 1Gig) it eventually gets into a state where it starts falling further and further behind other nodes in the same cluster in terms of processing updates. As this single node continues to fall further behind it starts to attempt to 'catchup' with its friends - further exacerbating its existing slowness by DDOSing itself by requesting catch up information from those same friends. This slow node then starts causing queues to build on the other nodes till eventually one or more of the nodes suffers a FULL java GC - which for our installation (50Gig Heap) causes the entire JVM to stop for 2minutes. Causing additional queues to fill and pushing the entire cluster to fall apart.

So:

  1. Slow node in a cluster (in our case caused by an interface @ 10mb/s when others are 1Gig) starts falling significantly behind friends with data updates
  2. as node falls behind it starts asking friend nodes in the cluster for updates to catch up.
  3. This in turn causes the slow node to DDOS of itself by flooding a slow interface with catch up traffic
  4. Which then causes queues on the other boxes to start filling with catch up messages bound for the slow node
  5. Queues are essentially unbounded eventually landing the entire cluster in a really bad state.

@joercampbell : Can this now be closed?

Yes - we'll be pulling in the 1.2.1 release some time in the nearish future if we have any problems we'll roll a new issue. Thanks.