Provide a few high-performance builders (for demanding software like chromium)

Question

Provide a few high-performance builders (for demanding software like chromium)

Closed this issue a year ago · 33 comments

fedora-copr-github-bot commented 2 years ago

Original issue: https://pagure.io/copr/copr/issue/2241
Opened: 2022-07-11 13:37:39
Opened by: daandemeyer

Trying to build Chromium for the fno-omit-frame-pointer proposal in copr fails after > 22h (see https://copr.fedorainfracloud.org/coprs/daandemeyer/fno-omit-frame-pointer/build/4621610/).

It'd be great to have beefier machines available for these kind of larger builds.

praiskup commented at 2022-07-11 15:57:56:

We could configure a new pool of AWS VMs (with at least 1 machine preallocated) with more resources. If we wanted to "start beefy VM on demand", RFE needs to be implemented.

We also need to allow users to specify custom build tags, with e.g. tag "expensive_build" which would likely mean (a) longer waiting in the queue (starting phase), but eventually (b) faster overall build.

But I'm not sure what would be the desired performance to finish the chromium build in
a reasonable time. And what is the reasonable time.

praiskup commented at 2022-07-11 16:10:29:

<Eighth_Doctor> being stuck at -j8 or -j10 will take forever
<praiskup> so do you claim 8-core machines will not help?
<Eighth_Doctor> it will not help
<Eighth_Doctor> you probably want at least 48 cores or more
<praiskup> what exactly will be achieved with that power?
<praiskup> build finished within 24 hoours, or faster?
<Eighth_Doctor> yeah
<Eighth_Doctor> more cores makes it go way, way, way faster
<Eighth_Doctor> the only reason it's limited is because Koji's builders lock up with high parallelism
<Eighth_Doctor> but I know Tom removes the limit when building locally
<Eighth_Doctor> (on AWS)
<praiskup> https://pagure.io/copr/copr/issue/2241
<Eighth_Doctor> OBS uses a constraints declaration system to identify good builders for packages
<Eighth_Doctor> https://build.opensuse.org/package/view_file/openSUSE:Factory/chromium/_constraints?expand=1
<Eighth_Doctor> with their constraints, it took 2377 seconds to build for x86_64
<Eighth_Doctor> roughly 40 minutes
<praiskup> How many cores were used?
<Eighth_Doctor> -j11
<Eighth_Doctor> but it also looks like their build config is different from ours
<Eighth_Doctor> a few more system libraries than what we do

sergiomb commented at 2022-07-11 20:32:55:

hi , I just got the same problem but just on rawhide

please try build it on F36 before, i.e. seems to me that is a specific problem on rawhide

praiskup commented at 2022-07-12 06:23:13:

Per IRC chat with @sergiomb, build is stuck on:

<sergiomb> now I see it stuck on python3 ../../tools/grit/grit.py -i ../../chrome/app/resources/locale_settings_linux.grd build -o gen/chrome --depdir . --depfile gen/chrome/app/resources/platform_locale_settings_grit.d --write-only-new=1 --depend-on-stamp -E root_gen_dir=gen -E root_src_dir=../../ -D SHARED_INTERMEDIATE_DIR=gen -D DEVTOOLS_GRD_PATH=gen/third_party/devtools-frontend/src/front_end/devtools_resources -D scale_factors=2x -D _chromium
 -E CHROMIUM_BUILD=chromium -D desktop_linux -D toolkit_views -D use_aura -D use_nss_certs -D use_ozone -D enable_arcore=false -D enable_background_mode=true -D enable_background_contents=true -D enable_extensions=true -D enable_hangout_services_extension=false -D enable_plugins=true -D enable_print_preview=true -D enable_printing=true -D enable_service_discovery=true -D enable_side_search=false -D enable_supervised_users=false -D
 enable_vr=false -D enable_webui_tab_strip=true -D safe_browsing_mode=1 -D optimize_webui=false -D enable_feed_v2=false -f gen/tools/gritsettings/default_resource_ids --assert-file-list obj/chrome/app/resources/platform_locale_settings_expected_outputs.txt

kernel-6.5.5-200.fc38 on `r7a.16xlarge` (x86_64)

real    24m18.515s
user    406m58.063s
sys     111m12.999s

kernel-6.5.4-403.asahi.fc40 on `r7g.16xlarge` (aarch64)

real    60m34.282s
user    1094m37.196s
sys     265m24.761s

webkitgtk-2.42.1-1.fc38 on `r7a.16xlarge` (x86_64)

real    85m2.248s
user    1155m30.860s
sys     118m18.907s

chromium-117.0.5938.92-2.fc38 on `r7g.16xlarge` (aarch64)

real    81m46.966s
user    4237m34.593s
sys     389m28.201s

Answer 13 · 2023-10-02T11:41:43.000Z

Checking https://aws.amazon.com/ec2/instance-types/#Memory_Optimized and I concur with Neal. That is:
r7g.16xlarge (AArch64) and r7a.16xlarge (x86_64)

Answer 14 · 2023-10-08T04:23:26.000Z

Yup, ~1 hour builds sound reasonable from my POV.

Asahi aarch64 kernels build 4k+16k variants, so ~2x the cost of x86_64, hence ~2x the time, so that all looks normal. They take about ~1hr to build on an M1 Ultra Mac Studio too, so we can say r7g.16xlarge is ~equivalent performance to local builds on our biggest available machines, which is what you want (it sucks when local builds are faster than COPR, since it makes us want to avoid COPR somehow; if it's around the same, we're all good).

Answer 15 · 2023-10-08T16:56:48.000Z

Just a quick check; r7g.16xlarge is $3.4/hour, if 80 minutes each build in one chroot is ~$4. With the old c7g.xlarge it is $0.1450/hour, and if 6 hours it is <= $1. I know time matters, but building in the cloud has other benefits than just that. And I'm curious if e.g. @davdunc as the sponsor can comment on this.

Answer 16 · 2023-10-08T17:01:42.000Z

Ah, I mixed up -> kernel is just 60 minutes, not 80. But still.

Answer 17 · 2023-10-09T13:24:52.000Z

FTR testing those instances on stage. Rawhide x86 and a64 builds got the powerful instances, Fedora 39 normal.

Answer 18 · 2023-10-09T20:39:50.000Z

@praiskup I think that it's reasonable for a handful of workloads here. We might want to look at statistics there. The 2xlarge as suggested by @Conan-Kudo is a good alternative. we do need to be frugal right now because the costing is under scrutiny.
That's not our fault, but it's something that we have to be cautious about.
Increasing job costs by $50/month isn't significant, but if it heads North $500/month, I'll be asked about it and we'll need to have a strong justification.
My suggestion would be for setting a soft limit of ~20 of these types of jobs if we were to identify a much larger instance type.

Answer 19 · 2023-10-09T22:33:19.000Z

Perhaps there's a balance in the middle between $3.40/hr and $0.1450/hr that is gives better performance but still under $1 total cost for a build.

Answer 20 · 2023-10-10T09:26:18.000Z

That would be the r7{a,g}.2xlarge instance type. That would double the CPU cores and substantially increase the amount of RAM for roughly 40~60 cents an hour. This would be a reasonable upgrade to the default builders and get us much closer to the build performance of the Koji builders.

Answer 21 · 2023-10-12T12:59:42.000Z

Btw., the testing build with powerful x86 builder took
2023-10-09 12:35:33,148 => 2023-10-09 13:27:05,719 so something like ~52
minutes. Time spent on the builder itself was much shorther though:

[2023-10-09 12:37:53,442][  INFO][PID:1210202] Starting remote build: copr-rpmbuild --verbose --drop-resultdir --task-url https://copr.stg.fedoraproject.org/backend/get-build-task/2915627-fedora-rawhide-x86_64 --chroot fedora-rawhide-x86_64 --detached
[2023-10-09 12:37:53,717][  INFO][PID:1210202] Downloading the builder-live.log file, attempt 1
[2023-10-09 13:02:54,185][  INFO][PID:1210202] Downloading results from builder
[2023-10-09 13:05:38,078][  INFO][PID:1210202] Releasing VM back to pool

28 minutes. It takes about 2 minutes till the machine starts from scratch
on demand, but anyway. It is alarming that the rest of the build process
takes more time than the build itself. Namely
2023-10-09 13:05:38,097 => 2023-10-09 13:27:04,388 (22 minutes spent on
signatures).

Answer 22 · 2023-10-12T13:12:27.000Z

The "signature" slowness is a separate problem, will be tracked here #2757.

Answer 23 · 2023-10-12T13:43:02.000Z

The previous testing chroot builds were assigned to our on-premise VMs, these
had their own problems (#2869), so the exercise was totally wrong.

Running yet another one experiment. This time the task allocation goes to the correct EC2 instances (ppc64le still on-premise, but fixed disk layout, s390x IBM Cloud):

11415 - state=OPEN tags=arch_aarch64,copr_builder resource=aws_aarch64_normal_dev_02814096_20231012_131851
11417 - state=OPEN tags=arch_ppc64le,copr_builder resource=copr_hv_ppc64le_01_dev_02814086_20231012_093733
11418 - state=OPEN tags=arch_x86_64,copr_builder resource=aws_x86_64_spot_dev_02814092_20231012_131806
11421 - state=OPEN tags=arch_ppc64le,copr_builder resource=copr_hv_ppc64le_02_dev_02814108_20231012_132338
11420 - state=OPEN tags=arch_aarch64,on_demand_powerful,copr_builder resource=aws_aarch64_powerful_spot_dev_02814139_20231012_133510
11419 - state=OPEN tags=arch_x86_64,on_demand_powerful,copr_builder resource=aws_x86_64_powerful_spot_dev_02814143_20231012_133526
11416 - state=OPEN tags=arch_s390x,copr_builder resource=copr_ibm_cloud_s390x_tokyo_dev_02814137_20231012_133510

Answer 24 · 2023-10-12T16:32:57.000Z

The experimental build failed for fedora-39-aarch64 because it had no swap. Let's do another one.

Answer 25 · 2023-10-12T16:43:42.000Z

And to have the experiment complete, one x86 on-premise build.

Answer 26 · 2023-10-16T07:57:37.000Z

I installed this testing configuration:
https://pagure.io/fedora-infra/ansible/c/424a9259fcb0b36abba09e9e25111951aebf6db9

Testing builds:
https://copr.stg.fedoraproject.org/coprs/g/copr/measure-hypervisor/build/2915642/
https://copr.stg.fedoraproject.org/coprs/g/copr/measure-aws/build/2915643/
https://copr.stg.fedoraproject.org/coprs/g/copr/measure-aws-powerful/build/2915644/
https://copr.stg.fedoraproject.org/coprs/g/copr/measure-ibm-cloud/build/2915645/
https://copr.stg.fedoraproject.org/coprs/g/copr/measure-hv-p08/build/2915646/

11452 - state=OPEN tags=aws,copr_builder,arch_aarch64 resource=aws_aarch64_normal_dev_02814527_20231012_162751
11453 - state=OPEN tags=aws,copr_builder,arch_x86_64 resource=aws_x86_64_spot_dev_02814883_20231014_121957
11454 - state=OPEN tags=copr_builder,arch_ppc64le,arch_power8 resource=copr_hv_ppc64le_01_dev_02814140_20231012_133512
11455 - state=OPEN tags=arch_power9,copr_builder,arch_ppc64le,hypervisor resource=copr_p09_01_dev_02814521_20231012_162546
11457 - state=OPEN tags=copr_builder,arch_x86_64,hypervisor resource=copr_hv_x86_64_02_dev_02814877_20231014_121947
11451 - state=OPEN tags=aws,copr_builder,arch_x86_64,on_demand_powerful resource=aws_x86_64_powerful_spot_dev_02814890_20231016_074146
11450 - state=OPEN tags=aws,copr_builder,arch_aarch64,on_demand_powerful resource=aws_aarch64_powerful_spot_dev_02814891_20231016_074146
11449 - state=OPEN tags=copr_builder,arch_s390x resource=copr_ibm_cloud_s390x_tokyo_dev_02814889_20231016_074121

Answer 27 · 2023-10-16T12:52:10.000Z

Nah, normal AWS builds timeouted -> I resubmitted the build with higher timeout (but seems obvious that VMs on hypervisors are faster than the normal AWS machines)

Answer 28 · 2023-10-19T15:14:17.000Z

Some stats from the experiment above.

Answer 29 · 2023-10-19T15:28:39.000Z

How come the ppc64le and s390x kernel builds provide a limited set of built artifacts? I suppose it's also the reason why they finish much faster.

Answer 30 · 2023-10-23T14:13:24.000Z

Ok, did a few more tests (see the stats document) -> and I think we'll go with:

	normal	on demand
x86_64	c7i.large	c7i.8xlarge
aarch64	c7g.large	c7g.8xlarge

The i4i.large was probably very sub-optimal for everything. The allocated 400G
disk there wasn't that much needed (while quite expensive) while the CPU/memory
was slower. In turn, chromium build on i4i.large is at least 2x more
expensive than on c7i.8xlarge, even though at least 16x slower.

I am still waiting for the final results of Kernel, but that will be quite
similar I bet.

Note that we still have 80 VM builders "in-house", in the Fedora Infra lab.
These are 2xCPU, 8G RAM (overall they are slightly faster than the i4i.large
but slower than the future "normal" AWS boxes c7i.large). These are able to
finish kernel builds within 4.5 hours. While I think there's a potential to
have "three performance categories" for builders (e.g. normal, medium, powerful)
in the future, at this point in time I'd prefer to have just two and keep the
system relatively easy to maintain and to learn the behavior patterns (we need
to keep in-house VMs roughly similar to those in EC2, because we always prefer
the in-house VMs over EC2, if any are free).

Any objections? Any volunteers to try the powerful instances so I can
enable the feature for you? Having a new ticket for this is preferred:

please specify copr project name
chroot(s), and
the package name

Note that there will be a limited amount of "powerful" machines (those users who
request them will likely have to wait for each other from time to time), and
will be started only on demand (no machines are preallocated, meaning you will
have to wait roughly 2 up to 5 minutes till the machine comes up).

Answer 31 · 2023-10-23T15:15:54.000Z

Request made: #2966

Answer 32 · 2023-10-23T23:59:34.000Z

What would you consider to be the threshold for requesting one of these more powerful builders, e.g. build time over x time?

Answer 33 · 2023-10-24T09:13:32.000Z

You can not request one or more builders like that; we can just declare/configure for you that builds of "foo package in bar project" will always be done on a powerful builder. We currently have up to 10 such builders of both x86 and aarch64 in parallel. And the builders will be shared by everyone, which of course will have some consequences (standing in queue from time to time). So it is mostly up to you but I wouldn't probably use those for builds that take less than one hour in total.

praiskup commented at 2022-07-11 15:57:56:

praiskup commented at 2022-07-11 16:10:29:

sergiomb commented at 2022-07-11 20:32:55:

praiskup commented at 2022-07-12 06:23:13:

kernel-6.5.5-200.fc38 on r7a.16xlarge (x86_64)

kernel-6.5.4-403.asahi.fc40 on r7g.16xlarge (aarch64)

webkitgtk-2.42.1-1.fc38 on r7a.16xlarge (x86_64)

chromium-117.0.5938.92-2.fc38 on r7g.16xlarge (aarch64)

kernel-6.5.5-200.fc38 on `r7a.16xlarge` (x86_64)

kernel-6.5.4-403.asahi.fc40 on `r7g.16xlarge` (aarch64)

webkitgtk-2.42.1-1.fc38 on `r7a.16xlarge` (x86_64)

chromium-117.0.5938.92-2.fc38 on `r7g.16xlarge` (aarch64)