hauler-dev/hauler

[BUG] Hauler Performance Issues

Closed this issue · 18 comments

Environmental Info:

root@hauler:~# uname -a
Linux hauler 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Hauler Version:

  • v1.0.7

Describe the Bug:

  • This isn't so much a bug, but a case to start tracking performance issues with hauler or its underlying dependencies.

Steps to Reproduce:

  • Hauler seems to behave faster on some VMs versus others
  • vSphere Lab 4CPUx4G-RAM finishes a full Rancher product sync (hauler store sync --product Rancher --version v2.8.5) in 4 hours
  • Proxmox Lab 4CPUx4GRAM finishes a full Rancher product sync (hauler store sync --product Rancher --version v2.8.5) in 8+ hours

Expected Behavior:

  • Hauler performance should be top notch across all platforms / OS. I don't know what the 'key' is at this time that makes it slower on some vs others

Actual Behavior:

  • Hauler seems to vary wildly on testing and customer infrastructures

Additional Context:

  • This has been reported by many customers and engineers internal to the company.
  • This also seems to NOT be reproducible by everyone. Some engineers advised their Hauler is always quick to pull down images (entire Rancher store in less than 2hours). Thus there may be an actual bug here then that is hitting some systems but not others.

We should use this case to start tracking all possible performance issues with Hauler with full environment specs and details to start narrowing down exactly what's happening

MS-01
harvester 1.3.1
8core x 16gb

----- without product -----
hauler:

real	3m59.072s
user	0m25.970s
sys	0m11.194s
skopeo:

real	4m54.921s
user	2m15.874s
sys	0m23.004s

using https://gist.github.com/clemenko/11edaa5f5c84c2f5f603257dcff6787d

vSphere Lab 4core X 4GB RAM

root@hauler:~# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
root@hauler:~# uname -a
Linux hauler 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

root@hauler:~# ./andys-test.sh
----- without product -----
hauler:

real    2m52.677s
user    1m24.988s
sys     0m36.545s
skopeo:

real    18m56.610s
user    11m55.358s
sys     6m17.364s
----- with product -----
hauler: with product without key

real    121m1.451s
user    31m56.568s
sys     13m33.563s
hauler: with product with key

real    99m30.211s
user    32m44.627s
sys     13m49.558s

I dont see the issue here. Also based on the test above key validation is actualy faster.....

Proxmox

4CPU/4GB
Ubuntu


with product with key
real 122m24.597s
user 74m31.084s
sys 10m4.424s
root@HaulerDev:~# time hauler store save -f withkey.tar.zst
2024-08-23 03:23:28 INF saved store [store] -> [/root/withkey.tar.zst]

real 3m55.884s
user 2m54.196s
sys 3m27.137s


Proxmox 4x8

root@newhauler-internal:~# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
root@newhauler-internal:~# uname -a
Linux newhauler-internal 5.15.0-119-generic #129-Ubuntu SMP Fri Aug 2 19:25:20 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
root@newhauler-internal:~# ./andys-test.sh
----- without product -----
hauler:

real    5m8.284s
user    1m35.108s
sys     0m28.597s
skopeo:

real    9m27.687s
user    5m13.061s
sys     1m0.094s
----- with product -----
2024-08-22 14:48:50 INF auth.go:274: logged in via /root/.docker/config.json
hauler: with product without key

real    572m59.959s
user    20m0.198s
sys     7m38.431s
hauler: with product with key

real    617m1.886s
user    18m10.455s
sys     6m20.971s

System Overview:

[ec2-user@ip-172-31-91-194 ~]$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
ID_LIKE="fedora"
VERSION_ID="2023"
PLATFORM_ID="platform:al2023"
PRETTY_NAME="Amazon Linux 2023.5.20240624"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
VENDOR_NAME="AWS"
VENDOR_URL="https://aws.amazon.com/"
SUPPORT_END="2028-03-15"

---------

[ec2-user@ip-172-31-91-194 ~]$ uname -a
Linux ip-172-31-91-194.ec2.internal 6.1.94-99.176.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Jun 18 14:57:56 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

---------

docker.io

[ec2-user@ip-172-31-91-194 ~]$ time hauler store sync -f airgap_hauler.yaml

real    3m35.088s
user    1m2.236s
sys     0m12.466s

TOTAL   |  8.5 GB

---------

rgcrprod.azurecr.us

[ec2-user@ip-172-31-91-194 ~]$ time hauler store sync -f carbide.yaml -s carbide-store

real    4m23.001s
user    1m37.314s
sys     0m14.929s

TOTAL   |  8.7 GB

---------

rgcrprod.azurecr.us with carbide-key.pub

[ec2-user@ip-172-31-91-194 ~]$ time hauler store sync -f carbide-key.yaml -s carbide-key-store

real    4m29.187s
user    1m50.926s
sys     0m17.000s

TOTAL   |  8.7 GB

System Overview:

[azureuser@hauler-testing ~]$ cat /etc/os-release
NAME="Rocky Linux"
VERSION="9.3 (Blue Onyx)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.3"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Rocky Linux 9.3 (Blue Onyx)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:9::baseos"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
SUPPORT_END="2032-05-31"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-9"
ROCKY_SUPPORT_PRODUCT_VERSION="9.3"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.3"

---------

[azureuser@hauler-testing ~]$ uname -a
Linux hauler-testing 5.14.0-362.8.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Nov 8 17:36:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

---------

docker.io

[azureuser@hauler-testing ~]$ time hauler store sync -f airgap_hauler.yaml

real    2m16.121s
user    1m8.562s
sys     0m16.214s

TOTAL   |  8.5 GB

---------

rgcrprod.azurecr.us

[azureuser@hauler-testing ~]$ time hauler store sync -f carbide.yaml -s carbide-store

real    3m17.944s
user    1m41.441s
sys     0m17.702s

TOTAL   |  8.7 GB

---------

rgcrprod.azurecr.us with carbide-key.pub

[azureuser@hauler-testing ~]$ time hauler store sync -f carbide-key.yaml -s carbide-key-store

real    4m05.921s
user    1m58.424s
sys     0m21.370s

TOTAL   |  8.7 GB

Results from DigitalOcean comparing a yaml pointing at docker/quay vs azure

https://gist.github.com/clemenko/f1a2389d34c9d69eafb08fe342b790e1

[root@flux hauler]# time hauler store sync  -f airgap_hauler.yaml > /dev/null 2>&1 && time hauler store sync -f carbide.yaml -s carbide > /dev/null 2>&1

real	2m41.285s
user	1m39.046s
sys	0m29.322s

real	5m57.116s
user	2m24.262s
sys	0m27.088s

This is without ANY public key.

System Overview:

[zackbradys@hauler ~]$ cat /etc/os-release
NAME="Rocky Linux"
VERSION="9.4 (Blue Onyx)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.4"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Rocky Linux 9.4 (Blue Onyx)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:9::baseos"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
SUPPORT_END="2032-05-31"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-9"
ROCKY_SUPPORT_PRODUCT_VERSION="9.4"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.4"

---------

[zackbradys@hauler ~]$ uname -a
Linux hauler 5.14.0-427.20.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Jun 7 14:51:39 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

---------

docker.io

[zackbradys@hauler ~]$ time hauler store sync -f airgap_hauler.yaml

real    3m9.761s
user    0m29.725s
sys     0m22.380s

TOTAL   |  8.5 GB

---------

rgcrprod.azurecr.us

[zackbradys@hauler ~]$ time hauler store sync -f carbide.yaml -s carbide-store

real    6m32.380s
user    0m42.627s
sys     0m25.667s

TOTAL   |  8.7 GB

---------

rgcrprod.azurecr.us with carbide-key.pub

[zackbradys@hauler ~]$ time hauler store sync -f carbide-key.yaml -s carbide-key-store

real    7m24.888s
user    0m51.636s
sys     0m19.589s

TOTAL   |  8.7 GB
[root@azurebadboy hauler]# time hauler store sync  -f airgap_hauler.yaml > /dev/null 2>&1 && time hauler store sync -f carbide.yaml -s carbide > /dev/null 2>&1

real	4m35.613s
user	0m24.322s
sys	0m10.875s

real	42m26.133s
user	0m48.193s
sys	0m24.053s

this is clearly an azure issue

another test from Digital Ocean

Ubuntu
real	2m32.682s
user	0m43.733s
sys	0m54.039s

real	5m49.913s
user	1m24.020s
sys	0m50.140s

macos

clembookpro:clemenko hauler $ time hauler store sync  -f airgap_hauler.yaml > /dev/null 2>&1 && time hauler store sync -f carbide.yaml -s carbide > /dev/null 2>&1

real	3m18.432s
user	0m36.609s
sys	0m27.978s

real	17m21.877s
user	1m24.617s
sys	1m24.076s

on harvester, ubuntu

root@philhasnolife:/opt/hauler# time hauler store sync  -f airgap_hauler.yaml > /dev/null 2>&1 && time hauler store sync -f carbide.yaml -s carbide > /dev/null 2>&1

real	4m32.614s
user	0m25.818s
sys	0m14.032s

real	38m43.256s
user	0m52.828s
sys	0m31.053s

I was able to track down the slowness on my network. My homelab was stuck on 100Mb due to a bad cable on from my core switch to my patch panel. After swapping that cable, my homelab is back to 1G and my Hauler stores of the full Rancher product are around 2 hours.

For those reporting performance issues with Hauler, please use the following test script (for Rocky) and provide the following information. If you're testing this on Ubuntu, adjust the script as needed.

Test Script for Rocky: https://gist.github.com/clemenko/11edaa5f5c84c2f5f603257dcff6787d
Required Info:

  • Platform (bare metal, vSphere, Proxmox, etc)
  • CPU/RAM (4CPU x 4G RAM, etc)
  • Speedtest results
  • Geographic area that you're pulling from (State / City would be amazing to help track down bad routes)

@HoustonDad Here are the results.

----- speed test -----
Ping: 53.103 ms
Download: 23.73 Mbit/s
Upload: 17.08 Mbit/s

----- without product -----
hauler:

real 8m38.141s
user 0m20.453s
sys 0m23.077s

----- with product -----
hauler: with product without key

real 0m10.670s
user 0m0.038s
sys 0m0.076s

hauler: with product with key

real 0m10.683s
user 0m0.098s
sys 0m0.196s

Fedora release 40 (Forty) on Windows 11 Pro
...........................................
CPU(s): 20
On-line CPU(s) list: 0-19
Vendor ID: GenuineIntel
Model name: 12th Gen Intel(R) Core(TM) i9-12900H
CPU family: 6
Model: 154
Thread(s) per core: 2
Core(s) per socket: 10
........................................
RAM:
total used free shared buff/cache available
Mem: 15Gi 574Mi 15Gi 2.4Mi 64Mi 14Gi
...................................
Speed Test (on WiFi) 186 Mbps download 152.6 Mbps upload Server: Chicago Google

I edited the manifest for Rancher 2.7.14 and took out 250+ lines and it still took 9 hours to download. The manifest for 2.7.14 is around 953 lines, correct?

@c-b-r It looks like something in that test failed out:


----- with product -----
hauler: with product without key

real 0m10.670s
user 0m0.038s
sys 0m0.076s

hauler: with product with key

real 0m10.683s
user 0m0.098s
sys 0m0.196s

Both of those ran for only 10 seconds, when it should have been at least 2 hours. Could you try to run some of those commands in the script manually to see what failed, fix that and run the test again?

Thanks!

Test Results from another system:

Platform (bare metal, vSphere, Proxmox, etc)

  • Ubuntu 20.04.6 LTS running in Windows 11 WSL2

CPU/RAM (4CPU x 4G RAM, etc)

  • 12 CPU x 6G RAM

Speedtest results

----- speed test -----
Ping: 13.551 ms
Download: 233.55 Mbit/s
Upload: 65.60 Mbit/s

Geographic area that you're pulling from (State / City would be amazing to help track down bad routes)
Houston, Texas (AT&T UVerse)

➜ ./test.sh
----- without product -----
hauler:

real    20m19.498s
user    0m57.803s
sys     0m34.921s

----- with product -----
hauler: with product without key

real    168m9.536s
user    9m53.409s
sys     3m30.358s

hauler: with product with key

real    199m53.807s
user    10m58.958s
sys     4m11.190s

Omaha NE. I was talking to you earlier in the day about minifying the manifest file. I'll try to run the script again tonight, I just ran it when I left without watching it.