[BUG] Hauler Performance Issues
Closed this issue · 18 comments
Environmental Info:
root@hauler:~# uname -a
Linux hauler 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Hauler Version:
- v1.0.7
Describe the Bug:
- This isn't so much a bug, but a case to start tracking performance issues with hauler or its underlying dependencies.
Steps to Reproduce:
- Hauler seems to behave faster on some VMs versus others
- vSphere Lab 4CPUx4G-RAM finishes a full Rancher product sync (hauler store sync --product Rancher --version v2.8.5) in 4 hours
- Proxmox Lab 4CPUx4GRAM finishes a full Rancher product sync (hauler store sync --product Rancher --version v2.8.5) in 8+ hours
Expected Behavior:
- Hauler performance should be top notch across all platforms / OS. I don't know what the 'key' is at this time that makes it slower on some vs others
Actual Behavior:
- Hauler seems to vary wildly on testing and customer infrastructures
Additional Context:
- This has been reported by many customers and engineers internal to the company.
- This also seems to NOT be reproducible by everyone. Some engineers advised their Hauler is always quick to pull down images (entire Rancher store in less than 2hours). Thus there may be an actual bug here then that is hitting some systems but not others.
We should use this case to start tracking all possible performance issues with Hauler with full environment specs and details to start narrowing down exactly what's happening
MS-01
harvester 1.3.1
8core x 16gb
----- without product -----
hauler:
real 3m59.072s
user 0m25.970s
sys 0m11.194s
skopeo:
real 4m54.921s
user 2m15.874s
sys 0m23.004s
using https://gist.github.com/clemenko/11edaa5f5c84c2f5f603257dcff6787d
vSphere Lab 4core X 4GB RAM
root@hauler:~# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
root@hauler:~# uname -a
Linux hauler 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
root@hauler:~# ./andys-test.sh
----- without product -----
hauler:
real 2m52.677s
user 1m24.988s
sys 0m36.545s
skopeo:
real 18m56.610s
user 11m55.358s
sys 6m17.364s
----- with product -----
hauler: with product without key
real 121m1.451s
user 31m56.568s
sys 13m33.563s
hauler: with product with key
real 99m30.211s
user 32m44.627s
sys 13m49.558s
I dont see the issue here. Also based on the test above key validation is actualy faster.....
Proxmox
4CPU/4GB
Ubuntu
with product with key
real 122m24.597s
user 74m31.084s
sys 10m4.424s
root@HaulerDev:~# time hauler store save -f withkey.tar.zst
2024-08-23 03:23:28 INF saved store [store] -> [/root/withkey.tar.zst]
real 3m55.884s
user 2m54.196s
sys 3m27.137s
Proxmox 4x8
root@newhauler-internal:~# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
root@newhauler-internal:~# uname -a
Linux newhauler-internal 5.15.0-119-generic #129-Ubuntu SMP Fri Aug 2 19:25:20 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
root@newhauler-internal:~# ./andys-test.sh
----- without product -----
hauler:
real 5m8.284s
user 1m35.108s
sys 0m28.597s
skopeo:
real 9m27.687s
user 5m13.061s
sys 1m0.094s
----- with product -----
2024-08-22 14:48:50 INF auth.go:274: logged in via /root/.docker/config.json
hauler: with product without key
real 572m59.959s
user 20m0.198s
sys 7m38.431s
hauler: with product with key
real 617m1.886s
user 18m10.455s
sys 6m20.971s
System Overview:
- AWS EC2 Instance (
us-east-1
) - AL2023 (
ami-01b799c439fd5516a
) - 8 cores | 16 GB of ram | 1024 GB of GP3 storage
- Related Manifests: https://gist.github.com/clemenko/f1a2389d34c9d69eafb08fe342b790e1
[ec2-user@ip-172-31-91-194 ~]$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
ID_LIKE="fedora"
VERSION_ID="2023"
PLATFORM_ID="platform:al2023"
PRETTY_NAME="Amazon Linux 2023.5.20240624"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
VENDOR_NAME="AWS"
VENDOR_URL="https://aws.amazon.com/"
SUPPORT_END="2028-03-15"
---------
[ec2-user@ip-172-31-91-194 ~]$ uname -a
Linux ip-172-31-91-194.ec2.internal 6.1.94-99.176.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Jun 18 14:57:56 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
---------
docker.io
[ec2-user@ip-172-31-91-194 ~]$ time hauler store sync -f airgap_hauler.yaml
real 3m35.088s
user 1m2.236s
sys 0m12.466s
TOTAL | 8.5 GB
---------
rgcrprod.azurecr.us
[ec2-user@ip-172-31-91-194 ~]$ time hauler store sync -f carbide.yaml -s carbide-store
real 4m23.001s
user 1m37.314s
sys 0m14.929s
TOTAL | 8.7 GB
---------
rgcrprod.azurecr.us with carbide-key.pub
[ec2-user@ip-172-31-91-194 ~]$ time hauler store sync -f carbide-key.yaml -s carbide-key-store
real 4m29.187s
user 1m50.926s
sys 0m17.000s
TOTAL | 8.7 GB
System Overview:
- Azure Virtual Machine (
Zone 1
) - ROCKY 9.3 (
rockylinux-x86_64-9-base
) - 8 cores | 16 GB of ram | 1024 GB of SSD storage
- Related Manifests: https://gist.github.com/clemenko/f1a2389d34c9d69eafb08fe342b790e1
[azureuser@hauler-testing ~]$ cat /etc/os-release
NAME="Rocky Linux"
VERSION="9.3 (Blue Onyx)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.3"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Rocky Linux 9.3 (Blue Onyx)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:9::baseos"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
SUPPORT_END="2032-05-31"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-9"
ROCKY_SUPPORT_PRODUCT_VERSION="9.3"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.3"
---------
[azureuser@hauler-testing ~]$ uname -a
Linux hauler-testing 5.14.0-362.8.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Nov 8 17:36:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
---------
docker.io
[azureuser@hauler-testing ~]$ time hauler store sync -f airgap_hauler.yaml
real 2m16.121s
user 1m8.562s
sys 0m16.214s
TOTAL | 8.5 GB
---------
rgcrprod.azurecr.us
[azureuser@hauler-testing ~]$ time hauler store sync -f carbide.yaml -s carbide-store
real 3m17.944s
user 1m41.441s
sys 0m17.702s
TOTAL | 8.7 GB
---------
rgcrprod.azurecr.us with carbide-key.pub
[azureuser@hauler-testing ~]$ time hauler store sync -f carbide-key.yaml -s carbide-key-store
real 4m05.921s
user 1m58.424s
sys 0m21.370s
TOTAL | 8.7 GB
Results from DigitalOcean comparing a yaml pointing at docker/quay vs azure
https://gist.github.com/clemenko/f1a2389d34c9d69eafb08fe342b790e1
[root@flux hauler]# time hauler store sync -f airgap_hauler.yaml > /dev/null 2>&1 && time hauler store sync -f carbide.yaml -s carbide > /dev/null 2>&1
real 2m41.285s
user 1m39.046s
sys 0m29.322s
real 5m57.116s
user 2m24.262s
sys 0m27.088s
This is without ANY public key.
System Overview:
- Harvester
v1.3.1
- ROCKY 9.4 (
qcow2
) - 8 cores | 16 GB of ram | 256 GB of SSD storage
- Related Manifests: https://gist.github.com/clemenko/f1a2389d34c9d69eafb08fe342b790e1
[zackbradys@hauler ~]$ cat /etc/os-release
NAME="Rocky Linux"
VERSION="9.4 (Blue Onyx)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.4"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Rocky Linux 9.4 (Blue Onyx)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:9::baseos"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
SUPPORT_END="2032-05-31"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-9"
ROCKY_SUPPORT_PRODUCT_VERSION="9.4"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.4"
---------
[zackbradys@hauler ~]$ uname -a
Linux hauler 5.14.0-427.20.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Jun 7 14:51:39 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
---------
docker.io
[zackbradys@hauler ~]$ time hauler store sync -f airgap_hauler.yaml
real 3m9.761s
user 0m29.725s
sys 0m22.380s
TOTAL | 8.5 GB
---------
rgcrprod.azurecr.us
[zackbradys@hauler ~]$ time hauler store sync -f carbide.yaml -s carbide-store
real 6m32.380s
user 0m42.627s
sys 0m25.667s
TOTAL | 8.7 GB
---------
rgcrprod.azurecr.us with carbide-key.pub
[zackbradys@hauler ~]$ time hauler store sync -f carbide-key.yaml -s carbide-key-store
real 7m24.888s
user 0m51.636s
sys 0m19.589s
TOTAL | 8.7 GB
[root@azurebadboy hauler]# time hauler store sync -f airgap_hauler.yaml > /dev/null 2>&1 && time hauler store sync -f carbide.yaml -s carbide > /dev/null 2>&1
real 4m35.613s
user 0m24.322s
sys 0m10.875s
real 42m26.133s
user 0m48.193s
sys 0m24.053s
this is clearly an azure issue
another test from Digital Ocean
Ubuntu
real 2m32.682s
user 0m43.733s
sys 0m54.039s
real 5m49.913s
user 1m24.020s
sys 0m50.140s
macos
clembookpro:clemenko hauler $ time hauler store sync -f airgap_hauler.yaml > /dev/null 2>&1 && time hauler store sync -f carbide.yaml -s carbide > /dev/null 2>&1
real 3m18.432s
user 0m36.609s
sys 0m27.978s
real 17m21.877s
user 1m24.617s
sys 1m24.076s
on harvester, ubuntu
root@philhasnolife:/opt/hauler# time hauler store sync -f airgap_hauler.yaml > /dev/null 2>&1 && time hauler store sync -f carbide.yaml -s carbide > /dev/null 2>&1
real 4m32.614s
user 0m25.818s
sys 0m14.032s
real 38m43.256s
user 0m52.828s
sys 0m31.053s
https://learn.microsoft.com/en-us/azure/container-registry/container-registry-skus What tier are we using?
I was able to track down the slowness on my network. My homelab was stuck on 100Mb due to a bad cable on from my core switch to my patch panel. After swapping that cable, my homelab is back to 1G and my Hauler stores of the full Rancher product are around 2 hours.
For those reporting performance issues with Hauler, please use the following test script (for Rocky) and provide the following information. If you're testing this on Ubuntu, adjust the script as needed.
Test Script for Rocky: https://gist.github.com/clemenko/11edaa5f5c84c2f5f603257dcff6787d
Required Info:
- Platform (bare metal, vSphere, Proxmox, etc)
- CPU/RAM (4CPU x 4G RAM, etc)
- Speedtest results
- Geographic area that you're pulling from (State / City would be amazing to help track down bad routes)
@HoustonDad Here are the results.
----- speed test -----
Ping: 53.103 ms
Download: 23.73 Mbit/s
Upload: 17.08 Mbit/s
----- without product -----
hauler:
real 8m38.141s
user 0m20.453s
sys 0m23.077s
----- with product -----
hauler: with product without key
real 0m10.670s
user 0m0.038s
sys 0m0.076s
hauler: with product with key
real 0m10.683s
user 0m0.098s
sys 0m0.196s
Fedora release 40 (Forty) on Windows 11 Pro
...........................................
CPU(s): 20
On-line CPU(s) list: 0-19
Vendor ID: GenuineIntel
Model name: 12th Gen Intel(R) Core(TM) i9-12900H
CPU family: 6
Model: 154
Thread(s) per core: 2
Core(s) per socket: 10
........................................
RAM:
total used free shared buff/cache available
Mem: 15Gi 574Mi 15Gi 2.4Mi 64Mi 14Gi
...................................
Speed Test (on WiFi) 186 Mbps download 152.6 Mbps upload Server: Chicago Google
I edited the manifest for Rancher 2.7.14 and took out 250+ lines and it still took 9 hours to download. The manifest for 2.7.14 is around 953 lines, correct?
@c-b-r It looks like something in that test failed out:
----- with product -----
hauler: with product without key
real 0m10.670s
user 0m0.038s
sys 0m0.076s
hauler: with product with key
real 0m10.683s
user 0m0.098s
sys 0m0.196s
Both of those ran for only 10 seconds, when it should have been at least 2 hours. Could you try to run some of those commands in the script manually to see what failed, fix that and run the test again?
Thanks!
Test Results from another system:
Platform (bare metal, vSphere, Proxmox, etc)
- Ubuntu 20.04.6 LTS running in Windows 11 WSL2
CPU/RAM (4CPU x 4G RAM, etc)
- 12 CPU x 6G RAM
Speedtest results
----- speed test -----
Ping: 13.551 ms
Download: 233.55 Mbit/s
Upload: 65.60 Mbit/s
Geographic area that you're pulling from (State / City would be amazing to help track down bad routes)
Houston, Texas (AT&T UVerse)
➜ ./test.sh
----- without product -----
hauler:
real 20m19.498s
user 0m57.803s
sys 0m34.921s
----- with product -----
hauler: with product without key
real 168m9.536s
user 9m53.409s
sys 3m30.358s
hauler: with product with key
real 199m53.807s
user 10m58.958s
sys 4m11.190s
Omaha NE. I was talking to you earlier in the day about minifying the manifest file. I'll try to run the script again tonight, I just ran it when I left without watching it.