fedora-copr/copr

Provide a few high-performance builders (for demanding software like chromium)

Closed this issue · 33 comments

Original issue: https://pagure.io/copr/copr/issue/2241
Opened: 2022-07-11 13:37:39
Opened by: daandemeyer

Trying to build Chromium for the fno-omit-frame-pointer proposal in copr fails after > 22h (see https://copr.fedorainfracloud.org/coprs/daandemeyer/fno-omit-frame-pointer/build/4621610/).

It'd be great to have beefier machines available for these kind of larger builds.


praiskup commented at 2022-07-11 15:57:56:

We could configure a new pool of AWS VMs (with at least 1 machine preallocated) with more resources. If we wanted to "start beefy VM on demand", RFE needs to be implemented.

We also need to allow users to specify custom build tags, with e.g. tag "expensive_build" which would likely mean (a) longer waiting in the queue (starting phase), but eventually (b) faster overall build.

But I'm not sure what would be the desired performance to finish the chromium build in
a reasonable time. And what is the reasonable time.


praiskup commented at 2022-07-11 16:10:29:

<Eighth_Doctor> being stuck at -j8 or -j10 will take forever
<praiskup> so do you claim 8-core machines will not help?
<Eighth_Doctor> it will not help
<Eighth_Doctor> you probably want at least 48 cores or more
<praiskup> what exactly will be achieved with that power?
<praiskup> build finished within 24 hoours, or faster?
<Eighth_Doctor> yeah
<Eighth_Doctor> more cores makes it go way, way, way faster
<Eighth_Doctor> the only reason it's limited is because Koji's builders lock up with high parallelism
<Eighth_Doctor> but I know Tom removes the limit when building locally
<Eighth_Doctor> (on AWS)
<praiskup> https://pagure.io/copr/copr/issue/2241
<Eighth_Doctor> OBS uses a constraints declaration system to identify good builders for packages
<Eighth_Doctor> https://build.opensuse.org/package/view_file/openSUSE:Factory/chromium/_constraints?expand=1
<Eighth_Doctor> with their constraints, it took 2377 seconds to build for x86_64
<Eighth_Doctor> roughly 40 minutes
<praiskup> How many cores were used?
<Eighth_Doctor> -j11
<Eighth_Doctor> but it also looks like their build config is different from ours
<Eighth_Doctor> a few more system libraries than what we do

sergiomb commented at 2022-07-11 20:32:55:

hi , I just got the same problem but just on rawhide

please try build it on F36 before, i.e. seems to me that is a specific problem on rawhide


praiskup commented at 2022-07-12 06:23:13:

Per IRC chat with @sergiomb, build is stuck on:

<sergiomb> now I see it stuck on python3 ../../tools/grit/grit.py -i ../../chrome/app/resources/locale_settings_linux.grd build -o gen/chrome --depdir . --depfile gen/chrome/app/resources/platform_locale_settings_grit.d --write-only-new=1 --depend-on-stamp -E root_gen_dir=gen -E root_src_dir=../../ -D SHARED_INTERMEDIATE_DIR=gen -D DEVTOOLS_GRD_PATH=gen/third_party/devtools-frontend/src/front_end/devtools_resources -D scale_factors=2x -D _chromium
 -E CHROMIUM_BUILD=chromium -D desktop_linux -D toolkit_views -D use_aura -D use_nss_certs -D use_ozone -D enable_arcore=false -D enable_background_mode=true -D enable_background_contents=true -D enable_extensions=true -D enable_hangout_services_extension=false -D enable_plugins=true -D enable_print_preview=true -D enable_printing=true -D enable_service_discovery=true -D enable_side_search=false -D enable_supervised_users=false -D
 enable_vr=false -D enable_webui_tab_strip=true -D safe_browsing_mode=1 -D optimize_webui=false -D enable_feed_v2=false -f gen/tools/gritsettings/default_resource_ids --assert-file-list obj/chrome/app/resources/platform_locale_settings_expected_outputs.txt

See also this Fedora Infra Koji issue.

Requested by the Tools team for the Red Hat Copr now, too.
Last time we discussed this with @msuchy, there was a requirement to have a reasonable
control over who can use what VMs. By example, once there's a beefy_x86_64 builder tag,
we should be able to administratively control who can use that tag.

Related PR for "on demand" machines with Resalloc: praiskup/resalloc#118

This has been merged in #2032. Resalloc 5.0 has been released recently. This is just waiting for Copr release and appropriate configuration.

I'm going to ping the folks who build chromium to get some beta testers; and we should have a doc page with all the pros/cons of this feature and how to request it.

no progress

@Conan-Kudo debating also in #2925, considering there will be another set of more powerful workers in Copr, what instances do you suggest for x86_64 and aarch64 in EC2?

Note that we will not start those machines in advance, these will be started when there's actual demand.

For aarch64, I think c7g.12xlarge, and for x86_64, I think c7i.12xlarge would be tremendous improvements.

Those provide significant vCPU and memory boosts. The only remaining issue might be whether you have dedicated NVMe IOPS, but I don't think most package builds would have I/O problems to that degree to require it.

It just occurred to me that Chromium and WebKitGTK are a class unto themselves, and c7g.16xlarge (AArch64) and c7i.16xlarge (x86_64) might be required to get them to be fast enough.

cc: @davdunc @davide125 @marcan

Since apparently we use Mock's tmpfs plugin, then we probably want memory-optimized instances with high amounts of fast RAM. r7g.16xlarge (AArch64) and r7a.16xlarge (x86_64) are good choices for that.

cc: @davdunc @davide125 @marcan

Since apparently we use Mock's tmpfs plugin

We do, but we don't have to. Just saying, we are configuring a new thing.

I ran a few experiments, building packages in mock with the tmpfs plugin enabled and max_fs_size set to 500G.

kernel-6.5.5-200.fc38 on r7a.16xlarge (x86_64)

real    24m18.515s
user    406m58.063s
sys     111m12.999s

kernel-6.5.4-403.asahi.fc40 on r7g.16xlarge (aarch64)

real    60m34.282s
user    1094m37.196s
sys     265m24.761s

webkitgtk-2.42.1-1.fc38 on r7a.16xlarge (x86_64)

real    85m2.248s
user    1155m30.860s
sys     118m18.907s

chromium-117.0.5938.92-2.fc38 on r7g.16xlarge (aarch64)

real    81m46.966s
user    4237m34.593s
sys     389m28.201s
xsuchy commented

Checking https://aws.amazon.com/ec2/instance-types/#Memory_Optimized and I concur with Neal. That is:
r7g.16xlarge (AArch64) and r7a.16xlarge (x86_64)

marcan commented

Yup, ~1 hour builds sound reasonable from my POV.

Asahi aarch64 kernels build 4k+16k variants, so ~2x the cost of x86_64, hence ~2x the time, so that all looks normal. They take about ~1hr to build on an M1 Ultra Mac Studio too, so we can say r7g.16xlarge is ~equivalent performance to local builds on our biggest available machines, which is what you want (it sucks when local builds are faster than COPR, since it makes us want to avoid COPR somehow; if it's around the same, we're all good).

Just a quick check; r7g.16xlarge is $3.4/hour, if 80 minutes each build in one chroot is ~$4. With the old c7g.xlarge it is $0.1450/hour, and if 6 hours it is <= $1. I know time matters, but building in the cloud has other benefits than just that. And I'm curious if e.g. @davdunc as the sponsor can comment on this.

Ah, I mixed up -> kernel is just 60 minutes, not 80. But still.

FTR testing those instances on stage. Rawhide x86 and a64 builds got the powerful instances, Fedora 39 normal.

@praiskup I think that it's reasonable for a handful of workloads here. We might want to look at statistics there. The 2xlarge as suggested by @Conan-Kudo is a good alternative. we do need to be frugal right now because the costing is under scrutiny.
That's not our fault, but it's something that we have to be cautious about.
Increasing job costs by $50/month isn't significant, but if it heads North $500/month, I'll be asked about it and we'll need to have a strong justification.
My suggestion would be for setting a soft limit of ~20 of these types of jobs if we were to identify a much larger instance type.

Perhaps there's a balance in the middle between $3.40/hr and $0.1450/hr that is gives better performance but still under $1 total cost for a build.

That would be the r7{a,g}.2xlarge instance type. That would double the CPU cores and substantially increase the amount of RAM for roughly 40~60 cents an hour. This would be a reasonable upgrade to the default builders and get us much closer to the build performance of the Koji builders.

Btw., the testing build with powerful x86 builder took
2023-10-09 12:35:33,148 => 2023-10-09 13:27:05,719 so something like ~52
minutes. Time spent on the builder itself was much shorther though:

[2023-10-09 12:37:53,442][  INFO][PID:1210202] Starting remote build: copr-rpmbuild --verbose --drop-resultdir --task-url https://copr.stg.fedoraproject.org/backend/get-build-task/2915627-fedora-rawhide-x86_64 --chroot fedora-rawhide-x86_64 --detached
[2023-10-09 12:37:53,717][  INFO][PID:1210202] Downloading the builder-live.log file, attempt 1
[2023-10-09 13:02:54,185][  INFO][PID:1210202] Downloading results from builder
[2023-10-09 13:05:38,078][  INFO][PID:1210202] Releasing VM back to pool

28 minutes. It takes about 2 minutes till the machine starts from scratch
on demand, but anyway. It is alarming that the rest of the build process
takes more time than the build itself. Namely
2023-10-09 13:05:38,097 => 2023-10-09 13:27:04,388 (22 minutes spent on
signatures).

The "signature" slowness is a separate problem, will be tracked here #2757.

The previous testing chroot builds were assigned to our on-premise VMs, these
had their own problems (#2869), so the exercise was totally wrong.

Running yet another one experiment. This time the task allocation goes to the correct EC2 instances (ppc64le still on-premise, but fixed disk layout, s390x IBM Cloud):

11415 - state=OPEN tags=arch_aarch64,copr_builder resource=aws_aarch64_normal_dev_02814096_20231012_131851
11417 - state=OPEN tags=arch_ppc64le,copr_builder resource=copr_hv_ppc64le_01_dev_02814086_20231012_093733
11418 - state=OPEN tags=arch_x86_64,copr_builder resource=aws_x86_64_spot_dev_02814092_20231012_131806
11421 - state=OPEN tags=arch_ppc64le,copr_builder resource=copr_hv_ppc64le_02_dev_02814108_20231012_132338
11420 - state=OPEN tags=arch_aarch64,on_demand_powerful,copr_builder resource=aws_aarch64_powerful_spot_dev_02814139_20231012_133510
11419 - state=OPEN tags=arch_x86_64,on_demand_powerful,copr_builder resource=aws_x86_64_powerful_spot_dev_02814143_20231012_133526
11416 - state=OPEN tags=arch_s390x,copr_builder resource=copr_ibm_cloud_s390x_tokyo_dev_02814137_20231012_133510

The experimental build failed for fedora-39-aarch64 because it had no swap. Let's do another one.

And to have the experiment complete, one x86 on-premise build.

I installed this testing configuration:
https://pagure.io/fedora-infra/ansible/c/424a9259fcb0b36abba09e9e25111951aebf6db9

Testing builds:
https://copr.stg.fedoraproject.org/coprs/g/copr/measure-hypervisor/build/2915642/
https://copr.stg.fedoraproject.org/coprs/g/copr/measure-aws/build/2915643/
https://copr.stg.fedoraproject.org/coprs/g/copr/measure-aws-powerful/build/2915644/
https://copr.stg.fedoraproject.org/coprs/g/copr/measure-ibm-cloud/build/2915645/
https://copr.stg.fedoraproject.org/coprs/g/copr/measure-hv-p08/build/2915646/

11452 - state=OPEN tags=aws,copr_builder,arch_aarch64 resource=aws_aarch64_normal_dev_02814527_20231012_162751
11453 - state=OPEN tags=aws,copr_builder,arch_x86_64 resource=aws_x86_64_spot_dev_02814883_20231014_121957
11454 - state=OPEN tags=copr_builder,arch_ppc64le,arch_power8 resource=copr_hv_ppc64le_01_dev_02814140_20231012_133512
11455 - state=OPEN tags=arch_power9,copr_builder,arch_ppc64le,hypervisor resource=copr_p09_01_dev_02814521_20231012_162546
11457 - state=OPEN tags=copr_builder,arch_x86_64,hypervisor resource=copr_hv_x86_64_02_dev_02814877_20231014_121947
11451 - state=OPEN tags=aws,copr_builder,arch_x86_64,on_demand_powerful resource=aws_x86_64_powerful_spot_dev_02814890_20231016_074146
11450 - state=OPEN tags=aws,copr_builder,arch_aarch64,on_demand_powerful resource=aws_aarch64_powerful_spot_dev_02814891_20231016_074146
11449 - state=OPEN tags=copr_builder,arch_s390x resource=copr_ibm_cloud_s390x_tokyo_dev_02814889_20231016_074121

Nah, normal AWS builds timeouted -> I resubmitted the build with higher timeout (but seems obvious that VMs on hypervisors are faster than the normal AWS machines)

Some stats from the experiment above.

How come the ppc64le and s390x kernel builds provide a limited set of built artifacts? I suppose it's also the reason why they finish much faster.

Ok, did a few more tests (see the stats document) -> and I think we'll go with:

normal on demand
x86_64 c7i.large c7i.8xlarge
aarch64 c7g.large c7g.8xlarge

The i4i.large was probably very sub-optimal for everything. The allocated 400G
disk there wasn't that much needed (while quite expensive) while the CPU/memory
was slower. In turn, chromium build on i4i.large is at least 2x more
expensive than on c7i.8xlarge, even though at least 16x slower.

I am still waiting for the final results of Kernel, but that will be quite
similar I bet.

Note that we still have 80 VM builders "in-house", in the Fedora Infra lab.
These are 2xCPU, 8G RAM (overall they are slightly faster than the i4i.large
but slower than the future "normal" AWS boxes c7i.large). These are able to
finish kernel builds within 4.5 hours. While I think there's a potential to
have "three performance categories" for builders (e.g. normal, medium, powerful)
in the future, at this point in time I'd prefer to have just two and keep the
system relatively easy to maintain and to learn the behavior patterns (we need
to keep in-house VMs roughly similar to those in EC2, because we always prefer
the in-house VMs over EC2, if any are free).

Any objections? Any volunteers to try the powerful instances so I can
enable the feature for you? Having a new ticket for this is preferred:

  • please specify copr project name
  • chroot(s), and
  • the package name

Note that there will be a limited amount of "powerful" machines (those users who
request them will likely have to wait for each other from time to time), and
will be started only on demand (no machines are preallocated, meaning you will
have to wait roughly 2 up to 5 minutes till the machine comes up).

Request made: #2966

What would you consider to be the threshold for requesting one of these more powerful builders, e.g. build time over x time?

You can not request one or more builders like that; we can just declare/configure for you that builds of "foo package in bar project" will always be done on a powerful builder. We currently have up to 10 such builders of both x86 and aarch64 in parallel. And the builders will be shared by everyone, which of course will have some consequences (standing in queue from time to time). So it is mostly up to you but I wouldn't probably use those for builds that take less than one hour in total.