basho/riak

Upgrading to 3.0

MikaAK opened this issue · 5 comments

Is there a guide anywhere on upgrading to 3.0 from earlier versions like 2.9?

The standard riak upgrade of adding a new upgraded node into the cluster doesn't seem to be working as we're met with out of memory issues. We tried increasing memory by 50% on the nodes and still the same issue so I'm wondering if there's another upgrade guide somewhere, or if anyone knows of another way to upgrade!

The standard way for any upgrade is to stop/update/start one node at a time across the cluster. There shouldn't be a need to do it by adding nodes unless you're changing storage backends.

Whichever way you go though, I wouldn't expect out of memory issues. This is something going unexpectedly wrong, as if you have triggered a bug. Do you have some information on your cluster you can share?

How many nodes;
Ring size;
Storage backend;
Number of clusters replicating;
Replication version used;
AAE version used;
Approximate key count;
Approximate mean object size;
Precise version migrating from and to;
Operating system;
Physical configuration of each node (CPU, memory, storage type).

It would be useful to know:

Are the OOM issues on all nodes, or just updated nodes;
If you run run riak admin top (3.0) or riak-admin top (2.9) sorted by memory, what are the processes hogging memory.

Here you go! I got most of this info from our dev-ops, lemme know if there's more i can get.

How many nodes; 5
Ring size; 128
Storage backend; multi
Number of clusters replicating; 5-6
Replication version used; not sure
AAE version used; not sure
Approximate key count; Not sure how to get this either, but maybe half a billion or more, we do around 100k puts daily
Approximate mean object size; This I'm not sure how to get this, if i had to guess I'd say mostly under 1kb, except one bucket which is full of 300kb blobs
Precise version migrating from and to; 2.9 -> 3.0.10
Operating system; debian 9 on 2.9 debian 10 on 3.0.10
Physical configuration of each node (CPU, memory, storage type)
16 CPU 72GB Ram 5TB data disk ssd

Are the OOM issues on all nodes, or just updated nodes; all nodes OOM and crash
If you run run riak admin top (3.0) or riak-admin top (2.9) sorted by memory, what are the processes hogging memory.
this causes a severe outage so we did not run these commands and cannot induce again to run them.

This did not happen in a staging cluster of the clones of prod 5 nodes, adding 5 new nodes 1 at a time and removing old 1 at a time. Same data and specs. Only difference is prod traffic during crash.

I don't understand this. There's no obvious reason for this behaviour.

The process of adding a node, and removing a node is much more expensive than stop/update/start - though I wouldn't immediately expect it to blow-up in terms of memory. Is there a reason why you're doing this update this way rather than simply stop/update/start?

There have been problems in Riak with leveled backends and excessive memory use. You can have a leveled backend if you enable tictac_aae, or if you set one of your backends to leveled in multi backend. Is leveled in play here?

From our DevOps:

I do not believe we are using leveld. Reasons we're doing add cluster are mostly, if we stop upgrade one of our five and the upgrade fails we just lost a node and we have to take it out of the load balancer so we would take a performance hit and possible outage

We're going to attempt a stop/upgrade/start in a test cluster though!

This is now fixed! Thanks for the support, we did a hybrid approach where we took the following steps and were successful

  1. remove old riak being replaced from load balancer
  2. spin up new debian 10 with new riak
  3. join cluster staged
  4. run replace cmd on current old to new riak
  5. replace staged
  6. run commit to have the old direct transfer to the new only while both are out of load balancer
  7. once done add new to lb and turn off old

Since it was a 1-1 transfer, this prevented the OOM it seems.

the two other approaches we tried:

  • adding new to cluster while in lb, failed 100% of the time
  • adding new to cluster while out of lb, failed for us 50%, 1 worked 1 did not