postgresql crash when call ```pg_query_state($pid,true,true,true,true,true)``` very freq.

Question

postgresql crash when call ```pg_query_state($pid,true,true,true,true,true)``` very freq.

digoal opened this issue 9 years ago · 8 comments

HI

when i set

pg_query_state.enable = on
pg_query_state.enable_buffers = off
pg_query_state.enable_timing = off

and test this case :

session a

do language plpgsql                                          
$$
declare
begin
for i in 1..100 loop
  perform 1 from pg_class;
  perform pg_sleep(10);
  perform t1.relname from pg_class t1 join pg_class t2 on (t1.oid=t2.oid) where t1.oid::int8 <100000;
end loop;
end;
$$;

session b
repeat run this query in psql:
select * from pg_query_state(22617,true,true,true,true,true);
\watch 0.000001

then database crashed.

when i set

pg_query_state.enable = on
pg_query_state.enable_buffers = on
pg_query_state.enable_timing = on

and test the same case, there is also has problem when call pg_query_state() function very freq.

                                                            Thu 08 Dec 2016 11:08:38 PM CST (every 1e-06s)

  pid  | frame_number |     query_text      |                                                    plan                                                     | leader_pid 
-------+--------------+---------------------+-------------------------------------------------------------------------------------------------------------+------------
 98079 |            0 | SELECT pg_sleep(10) | Result  (cost=0.00..0.01 rows=1 width=4) (Current loop: running time=3262.896 actual rows=0, loop number=1)+|           
       |              |                     |   Output: pg_sleep('10'::double precision)                                                                  | 
(1 row)

                                                            Thu 08 Dec 2016 11:08:38 PM CST (every 1e-06s)

  pid  | frame_number |     query_text      |                                                    plan                                                     | leader_pid 
-------+--------------+---------------------+-------------------------------------------------------------------------------------------------------------+------------
 98079 |            0 | SELECT pg_sleep(10) | Result  (cost=0.00..0.01 rows=1 width=4) (Current loop: running time=3263.361 actual rows=0, loop number=1)+|           
       |              |                     |   Output: pg_sleep('10'::double precision)                                                                  | 
(1 row)

......
then will crash

best regards,
digoal

Answer 1 · 2016-12-08T10:59:58.000Z

Hi, @digoal !

I have tested your case on stable version PostgreSQL 9.6 with PGPRO9_6 branch of pg_query_state and stable PostgreSQL 9.5 with PG9_5 branch and I could not reproduce failure. In my case postgres prints warnings: WARNING: timing statistics disabled and WARNING: buffers statistics disabled as it was intended by design.

Could you show last server logs before crash previously configuring and compiling postgres with --enable_cassert flag.

Answer 2 · 2016-12-08T13:49:17.000Z

i will compile with enable_cassert and test it again .

Answer 3 · 2016-12-08T14:21:24.000Z

On PostgreSQL 9.5 you can run python tests. The case that tests your problem exists there.

Unfortunately I haven't migrated those tests on version 9.6 yet. But in the future I going to do this.

Answer 4 · 2016-12-08T14:58:47.000Z

this the gdb's result when crashed.

#0  pg_query_state (fcinfo=Unhandled dwarf expression opcode 0xf3
) at pg_query_state.c:541
#1  0x00000000005ef39c in ExecMakeTableFunctionResult (funcexpr=0x1bbf220, econtext=0x1bbebe8, argContext=Unhandled dwarf expression opcode 0xf3
) at execQual.c:2211
#2  0x0000000000607342 in FunctionNext (node=0x1bbead0) at nodeFunctionscan.c:94
#3  0x00000000005f1569 in ExecScanFetch (node=0x1bbead0, accessMtd=0x607090 <FunctionNext>, recheckMtd=0x607080 <FunctionRecheck>) at execScan.c:95
#4  ExecScan (node=0x1bbead0, accessMtd=0x607090 <FunctionNext>, recheckMtd=0x607080 <FunctionRecheck>) at execScan.c:145
#5  0x00000000005e9e18 in ExecProcNode (node=0x1bbead0) at execProcnode.c:449
#6  0x00000000005e5cb6 in ExecutePlan (queryDesc=0x1bc1c38, direction=Unhandled dwarf expression opcode 0xf3
) at execMain.c:1567
#7  standard_ExecutorRun (queryDesc=0x1bc1c38, direction=Unhandled dwarf expression opcode 0xf3
) at execMain.c:338
#8  0x00007fb87485f8a5 in qs_ExecutorRun (queryDesc=0x1bc1c38, direction=ForwardScanDirection, count=0) at pg_query_state.c:312
#9  0x000000000070e7d8 in PortalRunSelect (portal=0x1bd6988, forward=Unhandled dwarf expression opcode 0xf3
) at pquery.c:948
#10 0x000000000070fe02 in PortalRun (portal=0x1bd6988, count=9223372036854775807, isTopLevel=Unhandled dwarf expression opcode 0xf3
) at pquery.c:789
#11 0x000000000070d74b in exec_simple_query (argc=Unhandled dwarf expression opcode 0xf3
) at postgres.c:1094
#12 PostgresMain (argc=Unhandled dwarf expression opcode 0xf3
) at postgres.c:4072
#13 0x000000000046b959 in BackendRun () at postmaster.c:4270
#14 BackendStartup () at postmaster.c:3944
#15 ServerLoop () at postmaster.c:1701
#16 0x00000000006a4f5d in PostmasterMain (argc=1, argv=Unhandled dwarf expression opcode 0xf3
) at postmaster.c:1309
#17 0x000000000046c83c in main (argc=1, argv=0x1ad4b50) at main.c:228

best regards,
digoal

Answer 5 · 2016-12-08T15:22:15.000Z

Does this problem pop up when you continuously invoke pg_query_state?

If so then this is familiar bug (race condition, more accurately). I have fixed it on PGPro 9.6 release but didn't not migrate fixes back to github. I will refresh that place by tomorrow.

Answer 6 · 2016-12-09T01:22:05.000Z

thank you,
Dose extensions in PGPro 9.6 updated in future will synchro in these stand alone repositions?

Answer 7 · 2016-12-09T17:05:56.000Z

I have pushed fixed version of pg_query_state. I have tested your problem case and haven't found any faults.

Thanks for your report. I'll add your test case in regression test suite.

Dose extensions in PGPro 9.6 updated in future will synchro in these stand alone repositions?

Yes, but these changes will be back integrated later when I will have a free time.

Answer 8 · 2016-12-10T02:42:28.000Z

thank you , I had test it, it's ok now.
best regards,
digoal