greenplum-db/postgres

concurrently update will generate huge buffer_content locks

digoal opened this issue · 3 comments

HI,
there is some testing for zedstore, when i use concurrent insert into on conflict test, here is some buffer_content locks
and qps=0;

postgres=# do language plpgsql $$
declare
  sql text;
begin
  sql := 'create table t_heap (id int8,';    
  for i in 1..10 loop
    sql := sql||'int'||i||' int,';
  end loop;
  for i in 1..10 loop
    sql := sql||'text'||i||' text,';
  end loop;
  sql := sql||'crt_time timestamp) using heap';    
  execute sql;
end;
$$;

postgres=# do language plpgsql $$
declare
  sql text;
begin
  sql := 'create table t_zedstore (id int8,';
  for i in 1..10 loop
    sql := sql||'int'||i||' int,';
  end loop;
  for i in 1..10 loop
    sql := sql||'text'||i||' text,';
  end loop;
  sql := sql||'crt_time timestamp) using zedstore';
  execute sql;
end;
$$;

postgres=# \d+ t_heap 
                                              Table "public.t_heap"
  Column  |            Type             | Collation | Nullable | Default | Storage  | Stats target | Description 
----------+-----------------------------+-----------+----------+---------+----------+--------------+-------------
 id       | bigint                      |           |          |         | plain    |              | 
 int1     | integer                     |           |          |         | plain    |              | 
 int2     | integer                     |           |          |         | plain    |              | 
 int3     | integer                     |           |          |         | plain    |              | 
 int4     | integer                     |           |          |         | plain    |              | 
 int5     | integer                     |           |          |         | plain    |              | 
 int6     | integer                     |           |          |         | plain    |              | 
 int7     | integer                     |           |          |         | plain    |              | 
 int8     | integer                     |           |          |         | plain    |              | 
 int9     | integer                     |           |          |         | plain    |              | 
 int10    | integer                     |           |          |         | plain    |              | 
 text1    | text                        |           |          |         | extended |              | 
 text2    | text                        |           |          |         | extended |              | 
 text3    | text                        |           |          |         | extended |              | 
 text4    | text                        |           |          |         | extended |              | 
 text5    | text                        |           |          |         | extended |              | 
 text6    | text                        |           |          |         | extended |              | 
 text7    | text                        |           |          |         | extended |              | 
 text8    | text                        |           |          |         | extended |              | 
 text9    | text                        |           |          |         | extended |              | 
 text10   | text                        |           |          |         | extended |              | 
 crt_time | timestamp without time zone |           |          |         | plain    |              | 
Access method: heap

postgres=# \d+ t_zedstore 
                                            Table "public.t_zedstore"
  Column  |            Type             | Collation | Nullable | Default | Storage  | Stats target | Description 
----------+-----------------------------+-----------+----------+---------+----------+--------------+-------------
 id       | bigint                      |           |          |         | plain    |              | 
 int1     | integer                     |           |          |         | plain    |              | 
 int2     | integer                     |           |          |         | plain    |              | 
 int3     | integer                     |           |          |         | plain    |              | 
 int4     | integer                     |           |          |         | plain    |              | 
 int5     | integer                     |           |          |         | plain    |              | 
 int6     | integer                     |           |          |         | plain    |              | 
 int7     | integer                     |           |          |         | plain    |              | 
 int8     | integer                     |           |          |         | plain    |              | 
 int9     | integer                     |           |          |         | plain    |              | 
 int10    | integer                     |           |          |         | plain    |              | 
 text1    | text                        |           |          |         | extended |              | 
 text2    | text                        |           |          |         | extended |              | 
 text3    | text                        |           |          |         | extended |              | 
 text4    | text                        |           |          |         | extended |              | 
 text5    | text                        |           |          |         | extended |              | 
 text6    | text                        |           |          |         | extended |              | 
 text7    | text                        |           |          |         | extended |              | 
 text8    | text                        |           |          |         | extended |              | 
 text9    | text                        |           |          |         | extended |              | 
 text10   | text                        |           |          |         | extended |              | 
 crt_time | timestamp without time zone |           |          |         | plain    |              | 
Access method: zedstore

\timing

insert into t_heap
select
id,
random()*10,
random()*100,
random()*1000,
random()*10000,
random()*100000,
random()*10,
random()*100,
random()*1000,
random()*10000,
random()*100000,
md5((random()*10)::int::text),
md5((random()*100)::int::text),
md5((random()*1000)::int::text),
md5((random()*10000)::int::text),
md5((random()*10)::int::text),
md5((random()*100)::int::text),
md5((random()*1000)::int::text),
md5((random()*10000)::int::text),
md5((random()*100)::int::text),
md5((random()*100)::int::text),
clock_timestamp()
from 
generate_series(1,10000000) t(id);

INSERT 0 10000000
Time: 55137.382 ms (00:55.137)

checkpoint;

insert into t_zedstore
select
id,
random()*10,
random()*100,
random()*1000,
random()*10000,
random()*100000,
random()*10,
random()*100,
random()*1000,
random()*10000,
random()*100000,
md5((random()*10)::int::text),
md5((random()*100)::int::text),
md5((random()*1000)::int::text),
md5((random()*10000)::int::text),
md5((random()*10)::int::text),
md5((random()*100)::int::text),
md5((random()*1000)::int::text),
md5((random()*10000)::int::text),
md5((random()*100)::int::text),
md5((random()*100)::int::text),
clock_timestamp()
from 
generate_series(1,10000000) t(id);

INSERT 0 10000000
Time: 156857.767 ms (02:36.858)


postgres=# \dt+
                              List of relations
 Schema |    Name    | Type  |  Owner   | Persistence |  Size   | Description 
--------+------------+-------+----------+-------------+---------+-------------
 public | t_heap     | table | postgres | permanent   | 4113 MB | 
 public | t_zedstore | table | postgres | permanent   | 1941 MB | 
(2 rows)


统计分析测试:

postgres=# explain select int1,count(*) from t_heap group by 1;
                              QUERY PLAN                               
-----------------------------------------------------------------------
 HashAggregate  (cost=676315.01..676315.12 rows=11 width=12)
   Group Key: int1
   ->  Seq Scan on t_heap  (cost=0.00..626315.34 rows=9999934 width=4)
(3 rows)

Time: 0.405 ms
postgres=# explain select int1,count(*) from t_zedstore group by 1;
                                QUERY PLAN                                 
---------------------------------------------------------------------------
 HashAggregate  (cost=151860.92..151861.03 rows=11 width=12)
   Group Key: int1
   ->  Seq Scan on t_zedstore  (cost=0.00..101861.95 rows=9999795 width=4)
(3 rows)

Time: 0.155 ms
postgres=# select int1,count(*) from t_zedstore group by 1;
 int1 |  count  
------+---------
    5 |  999956
    4 | 1000395
   10 |  501530
    6 |  998901
    2 |  999192
    7 | 1000936
    1 |  999489
    8 |  998940
    9 | 1000325
    3 | 1000801
    0 |  499535
(11 rows)

Time: 1343.633 ms (00:01.344)

postgres=# select text1,count(*) from t_zedstore group by 1;
              text1               |  count  
----------------------------------+---------
 c4ca4238a0b923820dcc509a6f75849b | 1001059
 8f14e45fceea167a5a36dedd4bea2543 |  999705
 d3d9446802a44259755d38e6d163e820 |  499813
 1679091c5a880faf6fb5e6087eb1b2dc | 1001648
 e4da3b7fbbce2345d7772b0674a318d5 |  999392
 eccbc87e4b5ce2fe28308fd9f2a7baf3 |  999374
 cfcd208495d565ef66e7dff9f98764da |  501183
 c9f0f895fb98ab9159f51fd0297e236d | 1001479
 45c48cce2e2d7fbdea1afc51c7c6ad26 |  998835
 a87ff679a2f3e71d9181a67b7542122c |  999037
 c81e728d9d4c2f636f067f89cc14862c |  998475
(11 rows)

Time: 1863.111 ms (00:01.863)

postgres=# select int1,count(*) from t_heap group by 1;
 int1 |  count  
------+---------
    5 | 1000662
    4 |  999872
   10 |  499901
    6 | 1000942
    2 | 1000378
    7 |  999201
    1 | 1001494
    8 |  998922
    9 |  998764
    3 |  999863
    0 |  500001
(11 rows)

Time: 2655.144 ms (00:02.655)

postgres=# select text1,count(*) from t_heap group by 1;
              text1               |  count  
----------------------------------+---------
 c4ca4238a0b923820dcc509a6f75849b |  999925
 8f14e45fceea167a5a36dedd4bea2543 | 1000277
 d3d9446802a44259755d38e6d163e820 |  497869
 1679091c5a880faf6fb5e6087eb1b2dc | 1000732
 e4da3b7fbbce2345d7772b0674a318d5 |  999516
 eccbc87e4b5ce2fe28308fd9f2a7baf3 | 1002371
 cfcd208495d565ef66e7dff9f98764da |  500511
 c9f0f895fb98ab9159f51fd0297e236d |  996678
 45c48cce2e2d7fbdea1afc51c7c6ad26 | 1000367
 a87ff679a2f3e71d9181a67b7542122c | 1001158
 c81e728d9d4c2f636f067f89cc14862c | 1000596
(11 rows)

Time: 3062.112 ms (00:03.062)


alter table t_heap add constraint pk primary key(id);
alter table t_zedstore add constraint pk1 primary key(id);

postgres=# \di+
                                 List of relations
 Schema | Name | Type  |  Owner   |   Table    | Persistence |  Size  | Description 
--------+------+-------+----------+------------+-------------+--------+-------------
 public | pk   | index | postgres | t_heap     | permanent   | 214 MB | 
 public | pk1  | index | postgres | t_zedstore | permanent   | 214 MB | 
(2 rows)


大宽表明细KV查询测试:

vi test1.sql

\set id random(1,10000000)
select * from t_heap where id=:id;
select * from t_zedstore where id=:id;

pgbench -M prepared -n -r -P 1 -f ./test1.sql -c 52 -j 52 -T 120

transaction type: ./test1.sql
scaling factor: 1
query mode: prepared
number of clients: 52
number of threads: 52
duration: 120 s
number of transactions actually processed: 6932745
latency average = 0.900 ms
latency stddev = 0.176 ms
tps = 57772.190636 (including connections establishing)
tps = 57774.347762 (excluding connections establishing)
statement latencies in milliseconds:
         0.001  \set id random(1,10000000)
         0.061  select * from t_heap where id=:id;
         0.838  select * from t_zedstore where id=:id;

         
高并发更新测试:

vi test2.sql

\set id random(1,100000000)
insert into t_heap values (:id,
random()*10,
random()*100,
random()*1000,
random()*10000,
random()*100000,
random()*10,
random()*100,
random()*1000,
random()*10000,
random()*100000,
md5((random()*10)::int::text),
md5((random()*100)::int::text),
md5((random()*1000)::int::text),
md5((random()*10000)::int::text),
md5((random()*10)::int::text),
md5((random()*100)::int::text),
md5((random()*1000)::int::text),
md5((random()*10000)::int::text),
md5((random()*100)::int::text),
md5((random()*100)::int::text),
clock_timestamp()
) on conflict (id) 
do update set 
int1=excluded.int1,
int2=excluded.int2,
int3=excluded.int3,
int4=excluded.int4,
int5=excluded.int5,
int6=excluded.int6,
int7=excluded.int7,
int8=excluded.int8,
int9=excluded.int9,
int10=excluded.int10,
text1=excluded.text1,
text2=excluded.text2,
text3=excluded.text3,
text4=excluded.text4,
text5=excluded.text5,
text6=excluded.text6,
text7=excluded.text7,
text8=excluded.text8,
text9=excluded.text9,
text10=excluded.text10,
crt_time=excluded.crt_time;
insert into t_zedstore values (:id,
random()*10,
random()*100,
random()*1000,
random()*10000,
random()*100000,
random()*10,
random()*100,
random()*1000,
random()*10000,
random()*100000,
md5((random()*10)::int::text),
md5((random()*100)::int::text),
md5((random()*1000)::int::text),
md5((random()*10000)::int::text),
md5((random()*10)::int::text),
md5((random()*100)::int::text),
md5((random()*1000)::int::text),
md5((random()*10000)::int::text),
md5((random()*100)::int::text),
md5((random()*100)::int::text),
clock_timestamp()
) on conflict (id) 
do update set 
int1=excluded.int1,
int2=excluded.int2,
int3=excluded.int3,
int4=excluded.int4,
int5=excluded.int5,
int6=excluded.int6,
int7=excluded.int7,
int8=excluded.int8,
int9=excluded.int9,
int10=excluded.int10,
text1=excluded.text1,
text2=excluded.text2,
text3=excluded.text3,
text4=excluded.text4,
text5=excluded.text5,
text6=excluded.text6,
text7=excluded.text7,
text8=excluded.text8,
text9=excluded.text9,
text10=excluded.text10,
crt_time=excluded.crt_time;

pgbench -M prepared -n -r -P 1 -f ./test2.sql -c 52 -j 52 -T 120
postgres=# select wait_event from pg_stat_activity ;
     wait_event      
---------------------
 AutoVacuumMain
 LogicalLauncherMain
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 page
 buffer_content
 buffer_content
 page
 page
 buffer_content
 buffer_content
 page
 buffer_content
 buffer_content
 page
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 page
 buffer_content
 page
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 page
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 buffer_content
 
 BgWriterHibernate
 CheckpointerMain
 WalWriterMain

br, digoal

[root@iZbp135pwcjjoxqgfpw9k1Z ~]# pstack 41065
#0  0x00007f5441b9ae43 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1  0x000000000075888e in WaitEventSetWait ()
#2  0x0000000000758ce9 in WaitLatchOrSocket ()
#3  0x00000000007701df in ProcSleep ()
#4  0x0000000000764cef in WaitOnLock ()
#5  0x0000000000766076 in LockAcquireExtended ()
#6  0x0000000000763f46 in LockPage ()
#7  0x000000000052d0e3 in zsundo_trim ()
#8  0x000000000052e32e in zsundo_get_oldest_undo_ptr ()
#9  0x0000000000520b95 in zsbt_tid_begin_scan ()
#10 0x000000000052a54f in zedstoream_fetch_row ()
#11 0x000000000052a9c1 in zedstoream_index_fetch_tuple ()
#12 0x00000000004d29b2 in index_fetch_heap ()
#13 0x00000000004d2a1b in index_getnext_slot ()
#14 0x000000000062cd2b in check_exclusion_or_unique_constraint ()
#15 0x000000000062d73e in ExecCheckIndexConstraints ()
#16 0x0000000000653de6 in ExecInsert ()
#17 0x0000000000655219 in ExecModifyTable ()
#18 0x000000000062e092 in standard_ExecutorRun ()
#19 0x000000000077c02a in ProcessQuery ()
#20 0x000000000077c258 in PortalRunMulti ()
#21 0x000000000077cc6d in PortalRun ()
#22 0x000000000077a902 in PostgresMain ()
#23 0x0000000000482278 in ServerLoop ()
#24 0x0000000000709d63 in PostmasterMain ()
#25 0x0000000000482ebe in main ()
[root@iZbp135pwcjjoxqgfpw9k1Z ~]# pstack 41063
#0  0x00007f544279aafb in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007f544279ab8f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f544279ac2b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00000000006f7f52 in PGSemaphoreLock ()
#4  0x0000000000769a5c in LWLockAcquire ()
#5  0x000000000051c8c3 in zsbt_descend ()
#6  0x0000000000521143 in zsbt_tid_multi_insert ()
#7  0x0000000000528010 in zedstoream_insert_speculative ()
#8  0x000000000065417f in ExecInsert ()
#9  0x0000000000655219 in ExecModifyTable ()
#10 0x000000000062e092 in standard_ExecutorRun ()
#11 0x000000000077c02a in ProcessQuery ()
#12 0x000000000077c258 in PortalRunMulti ()
#13 0x000000000077cc6d in PortalRun ()
#14 0x000000000077a902 in PostgresMain ()
#15 0x0000000000482278 in ServerLoop ()
#16 0x0000000000709d63 in PostmasterMain ()
#17 0x0000000000482ebe in main ()

Thank you for the report and for trying out Zedstore! I haven't had the chance to try out your reproduction yet. Do you see process 41065 or 41063 hanging indefinitely?

Also, what commit of Zedstore did you try this on?

From the stack trace it seems that they are both waiting on the meta page. The second stack trace (41063) seems to have a frame missing perhaps due to compiler optimizations? (there is no direct call to LWLockAcquire() inside zsbt_descend()), I would suspect that the missing frame is zsmeta_get_root_for_attribute() where there is an attempt to lock the meta page, which the other process is trying to lock (41065).
Having encountered scenarios in the code where we are trying to acquire a lock previously held, it is highly possible that one of the two backends is trying to acquire a lock that it already has on the meta page. A LWLockHeldByMe() call can help see that.

thank you,
this commit i used: 7b24bea
environment is aliyun's ECS CentOS 7.7 x64