uyuni-project/uyuni

Slow build/promotion of channel - DB hangs on "INSERT INTO rhnServerNeededCache"

mateusz-tyborski opened this issue · 6 comments

Problem description

After upgrade from Uyuni 2023.09 -> 2024.05 we noticed issue with performance of channels build or promotion if there is system register to them.
During investigation we noticed there is problem with these query:

INSERT INTO rhnServerNeededCache (server_id, package_id, channel_id)
                  ( SELECT DISTINCT
                     S.id as server_id,
                     P.id as package_id,
                     320 as channel_id
                        FROM
                             susePackageExcludingPartOfPtf P,
                             rhnServerPackageArchCompat SPAC,
                             rhnPackageEVR P_EVR,
                             rhnPackageEVR SP_EVR,
                             rhnServerPackage SP,
                             rhnServerChannel SC,
                             rhnServer S

                        WHERE
                                 SC.channel_id = 320
                          AND    SC.server_id = S.id
                          AND    p.id in (322676,322661,322721,312369,312855,322678,312872,322718,312364,322647,312861,308684,312356,307030,306979,344938,306984,308171,307020,307011,308693,312862,308681,308463,322675,308562,308683,312869,322681,322712,312858,308566,344935,322694,322697,322726,308447,312377,308689,306987,308466,322731,322688,307001,312863,306990,307000,307017,306991,312877,322707,344798,322720,312873,312361,308555,322696,308691,322695,308687,322660,308485,322691,307013,308169,322677,308690,308686,312371,322662,312367,308481,308557,306994,312351,308561,312876,312349,312859,306977,344804,308680,312354,312860,306980,306992,312359,306985,312368,344800,307009,322727,307021,322685,322679,322687,307024,322724,322698,308168,306996,312882,322669,322709,306997,322700,306993,344801,306988,322666,322664,312366,322711,322684,308461,322665,308170,308563,312880,322674,307007,322667,312879,312857,322730,312352,344805,322655,322651,311824,307033,306981,312353,344797,307002,307012,308692,322663,311820,312866,307025,312883,312864,322732,344939,308564,312878,308465,308484,312868,344934,312365,306998,312870,322650,312360,312865,306974,322659,308482,312362,308552,308682,307029,306995,306975,322668,322683,307032,308553,322652,307018,312867,306989,308688,307004,308565,308558,312348,311823,322672,344941,344802,322682,307008,322689,312350,306986,322658,307028,322690,322699,322706,322716,307015,312376,312357,322722,322717,322656,307016,322693,312373,307006,322733,322686,312358,308462,322714,344803,322705,311821,312363,322713,322701,322649,322723,344796,312372,322670,322657,322702,312875,307023,307031,322671,344937,322708,344799,344806,308554,344940,311822,312874,312370,312856,312881,312355,308464,322715,306973,322692,312374,308560,311819,322703,312375,308480,306999,307026,307005,307014,322704,312346,312378,322729,307010,322719,312347,308685,307027,306978,308559,322648,322680,307022,322653,322728,308556,322654,312871,306983,308483,308460,306982,344936,322710,322673,322725)
                          AND    NOT EXISTS (SELECT 1 FROM suseServerChannelsRetractedPackagesView WHERE sid = S.id AND pid = P.id)
                          AND    NOT EXISTS (SELECT 1 FROM suseServerAppStreamHiddenPackagesView   WHERE sid = S.id AND pid = P.id)
                          AND    p.package_arch_id = spac.package_arch_id
                          AND    spac.server_arch_id = s.server_arch_id
                          AND    SP_EVR.id = SP.evr_id
                          AND    P_EVR.id = P.evr_id
                          AND    SP.server_id = S.id
                          AND    SP.name_id = P.name_id
                          AND    SP.evr_id != P.evr_id
                          AND    (SP_EVR.evr).type = (P_EVR.evr).type
                          AND    SP_EVR.evr <  P_EVR.evr
                          AND    SP_EVR.evr = (SELECT MAX(PE.evr) FROM
                                          rhnServerPackage SP2, rhnPackageEvr PE WHERE PE.id = SP2.evr_id
                                          AND SP2.server_id = SP.server_id AND SP2.name_id = SP.name_id)
                          AND NOT EXISTS (SELECT 1 -- prevent inserting in case cache already contains the row (possibly with non-null errata_id)
                                          FROM rhnServerNeededCache EXISTING
                                          WHERE EXISTING.server_id = S.id
                                          AND EXISTING.package_id = p.id
                                          AND EXISTING.channel_id = 320))

It took around 2h for pgsql to calculate result: 1000010070 311819 320.
311819 is an id of package ansible-9.2.0-1.el8.noarch.rpm from EPEL repo.
Interesting part is that if this query will run only with this package id 311819 it will end immediately.
If this query will run without this package (but with all other - which are not on system) it will end in ~3s.
Maybe there should be loop for each package to avoid millions of loops in query (there is explain of query run in attachment
data-1718374537060.csv).

This issue we notice for CentOS7 Rocky8 Suse12 Suse15 channels so it's not depended on os version.
Second build/promotion doesn't have issue, but probably it's because there are no new packages to check.
As a workaround we temporary switch systems to other channels for time of build/promotion and after finish switch them back.

We also notice that without condition AND NOT EXISTS (SELECT 1 FROM suseServerAppStreamHiddenPackagesView WHERE sid = S.id AND pid = P.id) query executes fast, so maybe there is a problem with view "suseServerAppStreamHiddenPackagesView".

Steps to reproduce

  1. Build or promote channels with attached system (previous version need to be old enough to have updates for attached system)
  2. Check if there is hanging (15 minutes and more) query in sql with "INSERT INTO rhnServerNeededCache"
  3. There should be no new Taskomatic's logs regarding channel metadata generation for this build/promotion

or

  1. Get min 2 package ids of 1 system from rhnServerNeededCache
  2. Put in the query in where clause "AND p.id in ( id1,id2,... )"
  3. Run query

Uyuni version

2024.05-230900.217.1.uyuni3

Uyuni proxy version (if used)

No response

Useful logs

No response

Additional information

The issue is on both our independent servers.
DB consume only CPU, so it is not storage bottleneck.

@cbbayburt AFAIK you looked already into performance issues. Did you had this query on you radar?

Hi @mateusz-tyborski, thanks for the report. We're aware of the performance issues of this specific query and currently working on improving it.

It'd help us further if you could provide the following information about your environment:

  • How many rows exist in the following tables?
    • rhnServerNeededCache
    • rhnPackage
    • rhnServerPackage
  • How many systems are registered in total?
  • How many systems are assigned to the channel you're trying to promote?

Instance 1:
rhnserverneededcache: 851
rhnPackage: 262447
rhnServerPackage: 4892
system number: 7

Instance 2:
rhnserverneededcache: 30093
rhnPackage: 416531
rhnServerPackage: 249449
system number: 204

How many systems are assigned to the channel you're trying to promote?

There are 1-7 systems per channel. Issue is even if there is only 1 system on specific channel. From my observations if there are new packages in channel the issue appear. If there are no new package promotion ends in several minutes. Probably if there will be only one package (so only one package in WHERE clause in line "AND p.id in (<some_number>)" in problematic query), it should execute also fast.

I suggest to find optimization of problematic query to perform fast in both cases: one and many packages in where clause.
In our environments it is possible to simulate issue using steps I've wrote in "Steps to reproduce" paragraph.

I am also seeing this in my environments.
Is this getting fixed in the next version?

We are currently working on offloading some of the work away from this query to improve its performance. I can't tell yet if it'll be in the next update, but I'll update here as we progress.

ZsBT commented

I have the same issue; building custom channels hang forever. Peeking into the database, the same statement keeps running.
EXPLAIN shows extreme costs:
Unique (cost=6082629.82..6082629.89 rows=1 width=19) -> Merge Anti Join (cost=6082629.82..6082629.89 rows=1 width=19) Merge Cond: ((s.id = existing.server_id) AND (pkg.id = existing.package_id)) -> Sort (cost=6082620.06..6082620.07 rows=1 width=15) Sort Key: s.id, pkg.id -> Nested Loop Anti Join (cost=6081718.60..6082620.05 rows=1 width=15) Join Filter: ((sc_2.server_id = s.id) AND (sasp.package_id = pkg.id)) -> Nested Loop (cost=5881.66..6259.27 rows=1 width=15) Join Filter: ((((sp_evr.evr).type)::text = ((p_evr.evr).type)::text) AND (sp_evr.evr < p_evr.evr)) -> Nested Loop (cost=5881.37..6252.15 rows=1 width=62) -> Merge Anti Join (cost=5881.08..5882.39 rows=21 width=41) Merge Cond: ((pkg.id = p.id) AND (s.id = s_1.id)) -> Sort (cost=1746.20..1746.25 rows=21 width=41) Sort Key: pkg.id, sp.server_id -> Nested Loop (cost=89.57..1745.74 rows=21 width=41) -> Hash Join (cost=89.41..1742.07 rows=57 width=51) Hash Cond: (pkg.name_id = sp.name_id) Join Filter: (sp.evr_id <> pkg.evr_id) -> Nested Loop Anti Join (cost=0.85..1646.17 rows=279 width=3389) Join Filter: (pkg.id = pp.package_id) -> Index Scan using rhn_package_id_pk on rhnpackage pkg (cost=0.42..1482.50 rows=279 width=23) Index Cond: (id = ANY

Table row count:
rhnserverneededcache: 83201
rhnPackage: 469922
rhnServerPackage: 334833
system number: 501

Any idea for a temporary workaround?
Currently it blocks package installs which is quite painful.