citusdata/citus

Flaky single_node_enterprise test

Green-Chan opened this issue · 2 comments

I have Ubuntu 24.04, 13.2.0, PostgreSQL REL_16_3 configured with options CFLAGS=" -Og" --enable-tap-tests --enable-debug --with-openssl --with-libxml --enable-cassert --with-icu --with-lz4, Citus main (9e1852e).

First of all, when running make check-enterprise, I get an assertion failure (see #7591). So I comment that assertion:

--- a/src/backend/distributed/deparser/ruleutils_16.c
+++ b/src/backend/distributed/deparser/ruleutils_16.c
@@ -1589,7 +1589,7 @@ set_join_column_names(deparse_namespace *dpns, RangeTblEntry *rte,
                if (colinfo->is_new_col[col_index])
                        i++;
        }
-       Assert(i == colinfo->num_cols);
+       //Assert(i == colinfo->num_cols);
        Assert(j == nnewcolumns);
 #endif

Then I change enterprise_schedule so it runs lots of single_node_enterprise tests:

test: single_node_enterprise
test: single_node_enterprise
test: single_node_enterprise
test: single_node_enterprise
test: single_node_enterprise
test: single_node_enterprise
test: single_node_enterprise
test: single_node_enterprise
test: single_node_enterprise
test: single_node_enterprise

Then I run make check-enterprise and some of these tests fail with diff

--- /home/test/citus/src/test/regress/expected/single_node_enterprise.out.modified      2024-08-15 07:12:28.263667388 +0000
+++ /home/test/citus/src/test/regress/results/single_node_enterprise.out.modified       2024-08-15 07:12:28.275667634 +0000
@@ -465,28 +465,30 @@
NOTICE:  issuing /*{"cId":10,"tId":"101"}*/INSERT INTO single_node_ent.test_90730501 (x, y) VALUES (101, 100)
       INSERT INTO test(x,y) VALUES (102,100);
NOTICE:  issuing /*{"cId":10,"tId":"102"}*/INSERT INTO single_node_ent.test_90730502 (x, y) VALUES (102, 100)
       -- followed by a multi-shard command
       SELECT count(*) FROM test;
NOTICE:  issuing SELECT count(*) AS count FROM single_node_ent.test_90730501 test WHERE true
NOTICE:  issuing SELECT count(*) AS count FROM single_node_ent.test_90730502 test WHERE true
NOTICE:  issuing SELECT count(*) AS count FROM single_node_ent.test_90730503 test WHERE true
NOTICE:  issuing SELECT count(*) AS count FROM single_node_ent.test_90731504 test WHERE true
NOTICE:  issuing SELECT count(*) AS count FROM single_node_ent.test_90731505 test WHERE true
+NOTICE:  issuing BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;SELECT assign_distributed_transaction_id(0, 251, '2024-08-15 00:12:28.099834-07');
NOTICE:  issuing SELECT count(*) AS count FROM single_node_ent.test_90731506 test WHERE true
 count
-------
    53
(1 row)

ROLLBACK;
NOTICE:  issuing ROLLBACK
+NOTICE:  issuing ROLLBACK
-- should fail as only read access is allowed
SET ROLE read_access_single_node;
INSERT INTO test VALUES (1, 1, (95, 'citus9.5')::new_type);
ERROR:  permission denied for table test
SET ROLE postgres;
\c
SET search_path TO single_node_ent;
-- Cleanup
RESET citus.log_remote_commands;
SET client_min_messages TO WARNING;

I was able to reproduce the issue in the devcontainer environment.

When querying a view that was created before altering the schema of the underlying tables (specifically, after adding a new column), the server crashes with an assertion failure in the set_join_column_names function in ruleutils_16.c.

Here’s a link to the relevant test case:

SELECT * FROM view_created_before_shard_moves;

These are the columns related to the view:

SELECT * FROM (test JOIN colocated_table USING (x)) foo(x, y, z)
LEFT JOIN ref ON foo.x = ref.a;

 x  |  y  |      z       |  y  |    z     | a  |  b
---------------------------------------------------------------------

Here’s the relevant part of the server log:

LOG:  colinfo->num_cols: 6, i: 7, j: 7, nnewcolumns: 7
TRAP: failed Assert("i == colinfo->num_cols"), File: "deparser/ruleutils_16.c", Line: 1614
  • colinfo->num_cols represents the number of columns in the join at the time the view was created (which is 6).
  • After adding a new column z to colocated_table, the actual number of columns becomes 7.
  • When the deparser tries to reconstruct the view, it expects i (the index of processed columns) to match colinfo->num_cols.
  • Since i increments to 7 due to the new column, but colinfo->num_cols remains at 6, the assertion Assert(i == colinfo->num_cols) fails, causing the crash.

The deparser doesn't seem to handle schema changes in underlying tables that affect views.

After recreating the view, the assertion no longer fails.

CREATE OR REPLACE VIEW view_created_before_shard_moves AS
SELECT count(*) AS count
FROM (test JOIN colocated_table USING (x)) AS foo
LEFT JOIN ref ON (foo.x = ref.a);

@m3hm3t thank you for the investigation! Please note that this issue is about the test failing after dealing with assertion failure. I created a separate issue #7591 about this assertion failure. Consider placing your comment there.