StarpTech/k-andy

storage volumes / cluster

codeagencybe opened this issue · 12 comments

Hello

What is the best approach to have a high available storage layer in this setup?
With Docker setups, I typically add manually extra volumes which allows me to unattach and re-attach to other nodes in case of issues.
But with Kubernetes it seems that those volumes can not handle multi-nodes attachments at the same time.

I'm looking into Rancher Longhorn but do you use the node default disk for this or an extra attached volume? Or do you create a seperate cluster with eg NFS, Ceph, Rook, ...?

I'm kind of lost in the enormous amount of options for this. What is the best solution to use in combination with this automated script setup?

Thanks

toabi commented

That's indeed a good question.

I tried Longhorn, it theoretically works, but I really wanted nodes which can be replaced without thinking about anything so I didn't end up using it. Longhorn as far as I know only can use node disks directly and not PVCs. So rotation of a node means loosing a longhorn volume. Sholdn't be an issue if you have enough replicas configured…

I am running a rook-ceph in the same cluster which provides ReadWriteOnce volumes and S3 compatible buckets.
It works, but the performance is something I think is a bit limiting. I didn't dig deep yet why, but it's around 10x slower than my comparison on AWS. But it also costs me >10x less. At least it's very robust and I never lost anything on it. But it adds some complexity to the setup.

@toabi
Well, I'm perfectly fine with having more replicas for storage. If Longhorn can do that with eg minimum 3 nodes, that should in theory solve my problem then?
I don't mind the cost for that, as stability and performance is a bit higher priority for me than the additional cost.
The only problem is (I think) if I run some workloads that consume a lot of storage such as eg Nextcloud, the node disk could not be sufficient.
As the maximum is 360GB per server, there is no option for expanding that one.
The cpu/ram resources are totally fine, but it will run into problems as I can't expand the node disk.
Since the replication volumes should be equal in size, if 1 node disk hits 100%, then all others will hit the limit too.
Adding more nodes won't solve the problem either I guess since it doesn't "stack" the node disks as a bigger pool
Unless I'm missing something and it can do this?

I also found this in their docs as experimental/beta feature:
https://longhorn.io/docs/1.2.4/advanced-resources/rwx-workloads/

I also like the fact that Longhorn has a full backup solution build-in with support to S3 compatible buckets, Disaster backup/recovery etc...

Perhaps, I can re-archicture some applications by moving assets into offloading to an S3 bucket, then I can worry less about the node disks. I think Nextcloud and other applications have options for changing the primary storage type to something S3 compatible.

Using S3 buckets as primary storage is not a great idea for most applications. I would at least never run a mysql/postgresql database on an S3 bucket. Allthough it could work in theory, the performance is probably very bad.
I read about NFS which is as old as the streets and reliable but some people report poor performance so not sure about that one.
Rook with CEPH often is called a good performance combo. But you say it's also not the best performance?
I already a lot about OpenEBS and they claim to be very fast but the setup seems like way more complex.
They support all kinds of methods as ZFS for creating PV's

I could also deploy a MinIO workload and use that to handle storage distribution and run a local S3 platform for eg Wordpress workloads.

This is really though and complex to get the right storage. I want a fault-tolerant, preferably HA, storage for my workloads. I don't want to get headaches each time a node goes down and start troubleshooting why an application is not getting back online because it's filestore/volume got lost in limbo.

Anybody who can share a real world use case and experiences would be extremely helpful to make some proper decisions

I also read from an older article from Vito Botta about simply using the Hetzner CSI driver to use their HCLOUD volumes which are already replicated and high available by design.
But I think these are RWO and not RWX.
https://github.com/vitobotta/hetzner-k3s

How is something like this used in a cluster if one wants to run eg a workload like Wordpress with 3 replicas for high availability?
3 nodes each a CSI Hcloud volume and replicating the data? Or if a node goes down, does it automatically reschedule that volume to another node?
I don't find any useful information on the use case for this.

gc-ss commented

I want a fault-tolerant, preferably HA, storage for my workloads

This is the golden chalice everyone's looking for.

What you might not be considering is that low latency, HA PVs are extremely expensive.

It might, depending on your specific situation, be cheaper/efficient to use your DB's replication for DB and a delta replication of critical volumes unless you were ready to do Ceph HA

gc-ss commented

but the performance is something I think is a bit limiting. I didn't dig deep yet why, but it's around 10x slower than my comparison on AWS. But it also costs me >10x less

@toabi Would you be willing to share some specific details?

  1. Was the AWS volume EBS or (NVMe) SSD?
  2. What was the Ceph HA running on? Bare metal? With rook-ceph on another provider?

If Ceph was setup by you on AWS using anything less than one of their bare metal nodes, you would naturally expect a slowdown compared to EBS - but I would like some clarity into what your setup was.

Thank You for sharing!

"expensive" is a relative subject to everybody his personal project or business.
For hobby projects, I can agree everybody wants to keep the cost down as much as possible.
As a company, I care more about stability and fault-tolerance/high availability. Sure the cost will be higher, but it also means less cost going into support to handle issues from clients complaining about downtime or other problems.

About low latency, yes absolutely! Especially for databases.
But I think the attached volumes from Hetzner are pretty good in terms of performance. I use them for several years already with simple docker-compose stacks to run mysql/pgsql and Wordpress applications from directly.
There's barely any difference in performance from the native disk. But it gives me the opportunity to quickly swap VM's in cases of issues, re-attach the volume, deploy the stack again and done.

This is the part that I'm trying to find what would be the optimal use case in a real practice with k3s and Hetzner.
The price is not relevant at this point, thats up to me to decide if I want to invest that or not.

toabi commented

@toabi Would you be willing to share some specific details?

  1. Was the AWS volume EBS or (NVMe) SSD?
  2. What was the Ceph HA running on? Bare metal? With rook-ceph on another provider?

If Ceph was setup by you on AWS using anything less than one of their bare metal nodes, you would naturally expect a slowdown compared to EBS - but I would like some clarity into what your setup was.

Thank You for sharing!

It runs on m5.xlarge worker nodes with one ec-instance per OSD using just normal gp2 PVCs. We didn't do any special tuning, it's mostly the default deployment from the rook-ceph examples.

gc-ss commented

m5.xlarge worker nodes with one ec-instance per OSD using just normal gp2 PVC

toabi, Thank You - understood. This would be similar to running Ceph on Ceph. I understand that AWS Bare metal is very expensive but the i3 instances with local SSD to give to Ceph would be a better comparison. Even provisioned IOPS for gp2 (and gp3 recommended here) wouldn't come close to local SSD, in case you were to run these tests again.

@codeagencybe:

This is the part that I'm trying to find what would be the optimal use case in a real practice with k3s and Hetzner.
The price is not relevant at this point, thats up to me to decide if I want to invest that or not.

I might not have conveyed my point clearly - most people don't consider hosting a DB on a replicated volume (eg Ceph RBD) because the price-performance ratio pales in comparison to using the DB's built in replication.

Similarly, most people don't consider replicated volume (eg Ceph RBD) for their application storage if delta replication can be done at the application level (eg: use Litestream or LiftBridge/NATS).

If either/both the DB's built in replication or application level delta replication would be impractical or general performance performance is acceptable, or fault-tolerance/high availability is at a premium over performance, replicated volumes make a lot of sense.

I am more than happy to commission and carry out a study for you, please reach out.

There's barely any difference in performance from the native disk. But it gives me the opportunity to quickly swap VM's in cases of issues, re-attach the volume, deploy the stack again and done

Just want to clarify that:

  1. hcloud volumes are currently bound to the location they were created in and cannot migrate over the location boundary
  2. hcloud volumes are DC local Ceph RBDs. If there's a DC outage, you will lose your volumes
  3. There is absolutely a large difference in performance between a hcloud volume from a native NVMe SSD. You can test setting up a root server vs hcloud volume to compare. Again, more than happy to commission and carry out a study for you, please reach out, just want to ensure you have the right data to make the right decisions.
gc-ss commented

I also encourage us to take a look at https://github.com/syself/cluster-api-provider-hetzner

@gc-ss

There's barely any difference in performance from the native disk. But it gives me the opportunity to quickly swap VM's in cases of issues, re-attach the volume, deploy the stack again and done

Just want to clarify that:

  1. hcloud volumes are currently bound to the location they were created in and cannot migrate over the location boundary
  2. hcloud volumes are DC local Ceph RBDs. If there's a DC outage, you will lose your volumes
  3. There is absolutely a large difference in performance between a hcloud volume from a native NVMe SSD. You can test setting up a root server vs hcloud volume to compare. Again, more than happy to commission and carry out a study for you, please reach out, just want to ensure you have the right data to make the right decisions.

I'm fully aware that in case of a DC outage, everything is lost. But in all reality, what are the chances something like that would happen? That's a very extreme situation and I would count that as a "force majeur".
In such cases, I would fallback to external backups to get back up and running.

My use case would be to run simple applications/workloads like Wordpress, Magento, Nextcloud, etc... that can run "high available / fault torelant" in a "normal" situation where it's only a random node failure.
In those cases, it's easy to re-attach the volume.
For this use case, there really is barely any difference from running it from the server disk vs a volume disk. I do this already since many years. It runs butter smooth.
Sure there might be some performance difference, but real world use cases and metrics it's all perfect.

I am looking for a real use case where I can use Hetzner with K3s and Rancher to run these applications as "HA/fault tolerant" in my context. If it can survive even a full DC failure, even better. But I don't count on that.
Because the single point of failure already starts with the floating IP / load balancer. It's typically bound to 1 DC also, so if the DC goes down, poof so does single point of entry.

In such case, it be would even more convenient to just run the cluster script again to generate new infrastructure in another DC, restore the DR backup and we are back up and running.

gc-ss commented

My use case would be to run simple applications/workloads like Wordpress, Magento, Nextcloud, etc.

Ah! I see

Because the single point of failure already starts with the floating IP / load balancer. It's typically bound to 1 DC also, so if the DC goes down, poof so does single point of entry.

There are providers that have hybrid cloud LBs you can use. Scaleway works very well with Hetzner and they even have guides.

I'm fully aware that in case of a DC outage, everything is lost. But in all reality, what are the chances something like that would happen? That's a very extreme situation and I would count that as a "force majeur"

More freq. than you would think.

https://news.ycombinator.com/item?id=26407323

Also, there are multiple surfaces for outages.

What happens more often at Hetzner are networking issues, and those are hard to reproduce. As a result, you might have a lot of networking issues while I won't notice a thing:

https://www.reddit.com/r/hetzner/comments/tva8t1/frequent_outages_with_hetzner_should_i_be_worried/

https://lowendtalk.com/discussion/169439/hetzner-hel1-dc4-26-switches-down-at-once

This is why Hetzner does not provide an SLA - if it works for you, great. If not, it's on you, not Hetzner to figure it out. I have figure out a few ways around networking issues but that requires using both their root boxes on 10GbE and VMs.

I am looking for a real use case where I can use Hetzner with K3s and Rancher to run these applications as "HA/fault tolerant" in my context

Well, in this case, what's your concern with using hcloud volumes to back your PVs with cron/scheduled and checkpointed snapshotting?

gc-ss commented

There's barely any difference in performance from the native disk. But it gives me the opportunity to quickly swap VM's in cases of issues, re-attach the volume, deploy the stack again and done.

I didn't have time to verify this assertion but got to it today. I am unable to reproduce this assertion you made @codeagencybe

Here's what I see on the local instance SSD:

  fio Disk Speed Tests (Mixed R/W 50/50):
  ---------------------------------
  Block Size | 4k            (IOPS) | 64k           (IOPS)
    ------   | ---            ----  | ----           ----
  Read       | 131.67 MB/s  (32.9k) | 1.26 GB/s    (19.8k)
  Write      | 132.02 MB/s  (33.0k) | 1.27 GB/s    (19.9k)
  Total      | 263.69 MB/s  (65.9k) | 2.54 GB/s    (39.7k)
             |                      |
  Block Size | 512k          (IOPS) | 1m            (IOPS)
    ------   | ---            ----  | ----           ----
  Read       | 2.04 GB/s     (3.9k) | 2.26 GB/s     (2.2k)
  Write      | 2.15 GB/s     (4.2k) | 2.41 GB/s     (2.3k)
  Total      | 4.19 GB/s     (8.1k) | 4.68 GB/s     (4.5k)

Here's what I see on the instance attached volume:

  fio Disk Speed Tests (Mixed R/W 50/50):
  ---------------------------------
  Block Size | 4k            (IOPS) | 64k           (IOPS)
    ------   | ---            ----  | ----           ---- 
  Read       | 14.37 MB/s    (3.5k) | 111.90 MB/s   (1.7k)
  Write      | 14.37 MB/s    (3.5k) | 112.49 MB/s   (1.7k)
  Total      | 28.74 MB/s    (7.1k) | 224.40 MB/s   (3.5k)
             |                      |                     
  Block Size | 512k          (IOPS) | 1m            (IOPS)
    ------   | ---            ----  | ----           ---- 
  Read       | 296.02 MB/s    (578) | 291.98 MB/s    (285)
  Write      | 311.75 MB/s    (608) | 311.42 MB/s    (304)
  Total      | 607.78 MB/s   (1.1k) | 603.40 MB/s    (589)

The 10x degradation mirrors my intuition (hcloud volumes are Ceph RBDs after all) and matches what @toabi tested

I am curious how you tested

In summary: While hcloud volumes are 2x-5x faster than un-provisioned AWS gp3 EBS, they are still 10x "slower" than the local instance SSD. I will continue to recommend not hosting a DB on a replicated volume (Ceph RBD) because the price-performance ratio pales in comparison to using the DB's built in replication running off local instance SSD.