powa-team/pg_stat_kcache

Segmentation fault

Closed this issue ยท 34 comments

tbe commented

On one of our systems, stat_kache triggers a segfault:

PG: 11.8, kcache: 2.1.2

(gdb) bt
#0  0x000000000085c3d7 in get_hash_entry (freelist_idx=0, hashp=0xd74d50) at dynahash.c:1322
#1  hash_search_with_hash_value (hashp=0xd74d50, keyPtr=keyPtr@entry=0x7fffffffd360, hashvalue=492298745, action=action@entry=HASH_ENTER, foundPtr=foundPtr@entry=0x7fffffffd270) at dynahash.c:1055
#2  0x000000000085c46d in hash_search (hashp=<optimized out>, keyPtr=keyPtr@entry=0x7fffffffd360, action=action@entry=HASH_ENTER, foundPtr=foundPtr@entry=0x7fffffffd270) at dynahash.c:911
#3  0x00007fffe8834c85 in pgsk_entry_alloc (key=key@entry=0x7fffffffd360) at pg_stat_kcache.c:634
#4  0x00007fffe8835640 in pgsk_entry_store (queryId=<optimized out>, counters=...) at pg_stat_kcache.c:587
#5  pgsk_ExecutorEnd (queryDesc=0xddfd40) at pg_stat_kcache.c:806
#6  0x00000000005bdb24 in PortalCleanup (portal=<optimized out>) at portalcmds.c:301
#7  0x00000000008797aa in PortalDrop (portal=0xe19970, isTopCommit=<optimized out>) at portalmem.c:499
#8  0x0000000000746008 in exec_simple_query (query_string=0xd783e0 "SELECT datname FROM pg_database WHERE datistemplate = false;") at postgres.c:1155
#9  0x00000000007472ca in PostgresMain (argc=<optimized out>, argv=argv@entry=0xdc5a90, dbname=0xdc59b8 "postgres", username=<optimized out>) at postgres.c:4193
#10 0x00000000004807f6 in BackendRun (port=0xdc3420) at postmaster.c:4364
#11 BackendStartup (port=0xdc3420) at postmaster.c:4036
#12 ServerLoop () at postmaster.c:1707
#13 0x00000000006d9fd9 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0xd72d20) at postmaster.c:1380
#14 0x000000000048167f in main (argc=5, argv=0xd72d20) at main.c:228

This happens on a hot-standby system

rjuju commented

Hello. Thanks for the report. I see that there's an issue in the shared_memory size computation :(.

tbe commented

Do you need any help, additional infos, configuration settings? The strange thing is, we only have an issue at this one cluster, all other clusters with the same versions are running stable.

rjuju commented

Commit 19e1d44 should fix the issue.

I tried to reproduce the problem locally but I've been unable to so far. If you have a test environment where you already hit the issue, could you check if this is now ok?

rjuju commented

The strange thing is, we only have an issue at this one cluster, all other clusters with the same versions are running stable.

This isn't especially surprising. You probably can only hit this issue if you have almost all possible connections active and if all other hash table in shared memory have already expanded to their full size (by default the lock table and other hash table are only initialized to half of the total size, so you need to acquire enough lock at the same time to grow it).

tbe commented

The change does not solve the issue here.

In the meantime, we have discovered, that the master also has an issue, trying to allocate HUGE amounts of memory:

2020-07-15 11:43:24 CEST [17884]: [6-1] user=postgres,db=postgres_backup,xid=0,vxid=32/3,app=[unknown] STATEMENT:  SELECT "pg_stat_io_user_tables"."schemaname", "pg_stat_io_user_tables"."relname", "pg_stat_io_user_tables"."heap_blks_read
", "pg_stat_io_user_tables"."heap_blks_hit", "pg_stat_io_user_tables"."idx_blks_read", "pg_stat_io_user_tables"."idx_blks_hit", "pg_stat_io_user_tables"."toast_blks_read", "pg_stat_io_user_tables"."toast_blks_hit", "pg_stat_io_user_table
s"."tidx_blks_read", "pg_stat_io_user_tables"."tidx_blks_hit" FROM pg_statio_user_tables AS "pg_stat_io_user_tables" WHERE (schemaname NOT LIKE 'pg_temp_%')
2020-07-15 11:43:24 CEST [17879]: [3-1] user=postgres,db=usermgmt,xid=0,vxid=24/2,app=[unknown] ERROR:  invalid memory alloc request size 16913468333384398648

We can not test the fix on this side, at this time, as we would have to restart the master for that.

You probably can only hit this issue if you have almost all possible connections active and if all other hash table in shared memory have already expanded to their full size (by default the lock table and other hash table are only initialized to half of the total size, so you need to acquire enough lock at the same time to grow it).

The segmentation fault happens on the standby, where nearly no connection is open, expect for our prometheus exporter and monitoring.

What we can see in the backtrace is, that the nentries member of the hclt seems corrupt:

{freeList = {{mutex = 56 '8', nentries = 8524192900200068020, freeList = 0xb2bd2eb1a7292638}, {mutex = 56 '8', nentries = -5567242223811484104, freeList = 0x764c04479554f3b4}, {mutex = 56 '8', nentries = 8524192900200068020, 
      freeList = 0xb2bd2eb1a7292638}, {mutex = 56 '8', nentries = 8524192900200068020, freeList = 0xb2bd2eb1a7292638}, {mutex = 56 '8', nentries = -5567242223811484104, freeList = 0x764c04479554f3b4}, {mutex = 180 '\264', 
      nentries = 8524192900200068020, freeList = 0x764c04479554f3b4}, {mutex = 56 '8', nentries = 8524192900200068020, freeList = 0x764c04479554f3b4}, {mutex = 180 '\264', nentries = 8524192900200068020, freeList = 0x764c04479554f3b4}, 
    {mutex = 180 '\264', nentries = 8524192900200068020, freeList = 0x764c04479554f3b4}, {mutex = 180 '\264', nentries = 0, freeList = 0x0}, {mutex = 0 '\000', nentries = 0, freeList = 0x0} <repeats 22 times>}, dsize = 256, nsegs = 32, 
  max_bucket = 8191, high_mask = 16383, low_mask = 8191, keysize = 16, entrysize = 128, num_partitions = 0, ffactor = 1, max_dsize = 256, ssize = 256, sshift = 8, nelem_alloc = 56}```

tbe commented

We can not test the fix on this side, at this time, as we would have to restart the master for that.

We decided to give it a try, and tested this as hotfix on the master. The above memory allocation errors are gone, we will monitor it a little bit longer. So, that fixed half of our issue.

We are analyzing this issue with support of @anse1 on the affected system.

rjuju commented

The change does not solve the issue here.

Just to be sure, did you restart postgres after compiling the new binary?

In the meantime, we have discovered, that the master also has an issue, trying to allocate HUGE amounts of memory:

This looks like a corrupted field somewhere. Is the error reproducible, and if yes can you get the error message doing first SET log_error_verbosity = verbose;

What we can see in the backtrace is, that the nentries member of the hclt seems corrupt:

Indeed. The fact that the allocated shared memory size was previously too low could lead pg_stat_kcache to overwrite wrong memory parts.

tbe commented

First:

We decided to give it a try, and tested this as hotfix on the master. The above memory allocation errors are gone, we will monitor it a little bit longer.

The error happens again. log_error_verbository exposes just:

2020-07-15 14:39:06 CEST [11695]: [35-1] user=app_vault,db=vault,xid=754720649,vxid=10/302,app=[unknown] ERROR:  XX000: invalid memory alloc request size 12922605294342592712
2020-07-15 14:39:06 CEST [11695]: [36-1] user=app_vault,db=vault,xid=754720649,vxid=10/302,app=[unknown] LOCATION:  palloc, mcxt.c:934

Just to be sure, did you restart postgres after compiling the new binary?

Yes, we start the instance in the foreground for each test.

Indeed. The fact that the allocated shared memory size was previously too low could lead pg_stat_kcache to overwrite wrong memory parts.

-Yeah, but the other members of the struct are most likely valid.- EDIT: No, they are corrupt too, the outer struct is valid, everything in hctl seems random.

Due a hypervisor crash last week, it is possible that we have corruption in the catalog (i don't trust vmware ...). But the crashes and allocations are only triggered if stat_kcache is loaded.

tbe commented

The memory allocation errors are most likely triggered here: https://github.com/powa-team/pg_stat_kcache/blob/master/pg_stat_kcache.c#L686

If the nentries field is corrupted, this will try to allocate this insane amount of memory.

I have verified the kcache stat file, to make sure there is no corruption there:

$ hexdump pg_stat/pg_stat_kcache.stat 
0000000 6e10 0d75 0000 0000                    
0000008

BTW:
The while loop in https://github.com/powa-team/pg_stat_kcache/blob/master/pg_stat_kcache.c#L630 compares pgsk_max (int) with the result of hash_get_num_entries (long).

rjuju commented

I just found another issue which would explain the symptoms you have, fixed with 9f36117. I went through the rest of the code and didn't spot any other issue.

tbe commented

I just found another issue which would explain the symptoms you have, fixed with 9f36117. I went through the rest of the code and didn't spot any other issue.

We'll give it a try

rjuju commented

BTW:
The while loop in https://github.com/powa-team/pg_stat_kcache/blob/master/pg_stat_kcache.c#L630 compares pgsk_max (int) with the result of hash_get_num_entries (long).

Indeed, but this shouldn't be an issue as pgss_max can't exceed INT_MAX. FTR, this is using the same code as for upstream pg_stat_statements

static int	pgss_max;			/* max # statements to track */
[...]
static pgssEntry *
entry_alloc(pgssHashKey *key, Size query_offset, int query_len, int encoding,
			bool sticky)
{
	pgssEntry  *entry;
	bool		found;

	/* Make space if needed */
	while (hash_get_num_entries(pgss_hash) >= pgss_max)
		entry_dealloc();
[...]
tbe commented

I just found another issue which would explain the symptoms you have, fixed with 9f36117. I went through the rest of the code and didn't spot any other issue.

Looks good on the standby atm. We are monitoring this a little bit longer, but as the original issue happened within a few seconds after start, it looks very promising.

tbe commented

Looks still good. Thank you for the fast solution. Closing this issue for now, we will test on the master tomorrow, and reopen if needed.

When will this be pushed to the master branch?

rjuju commented

This is already pushed to master branch, so you can use the fix right now. I'll wait until tomorrow if @tbe confirms that there's still no issue and push a new release.

rjuju commented

Closing this issue for now, we will test on the master tomorrow, and reopen if needed.

Thanks a lot @tbe !

Also got it on CentOS 7 and Postgres 9.6 (doesn't matter which - saw it on 9.6.3, 9.6.12, 9.6.18) -- once the package was updated (to pg_stat_kcache96-2.1.2-1.rhel7.x86_64; SQL in databases remained on 2.1.1), started to get a bunch of various errors:

  1. ERROR: invalid memory alloc request size 16032070410878933784 (various values here)
  2. sometimes PANIC: hash table "pg_stat_kcache hash" corrupted
  3. finally, not that often, but sometimes segfaults. So far noticed 3 slightly different cases, all failing in dynahash.c (unfortunately, I don't have full backtraces now):
Cannot access memory at address 0x7f1a72ea1128
Cannot access memory at address 0x7f1a72ea1120
Failed to read a valid object file image from memory.
Core was generated by `postgres: ***(55554)           '.
Program terminated with signal 11, Segmentation fault.
#0  hash_search_with_hash_value (hashp=0xca3338, hashp@entry=<error reading variable: Cannot access memory at address 0x7ffd021ac418>,
    keyPtr=<error reading variable: Cannot access memory at address 0x7ffd021ac398>, keyPtr@entry=<error reading variable: Cannot access memory at address 0x7ffd021ac418>, hashvalue=313889387,
    hashvalue@entry=<error reading variable: Cannot access memory at address 0x7ffd021ac418>, action=<error reading variable: Cannot access memory at address 0x7ffd021ac3a0>,
    action@entry=<error reading variable: Cannot access memory at address 0x7ffd021ac418>, foundPtr=<error reading variable: Cannot access memory at address 0x7ffd021ac3a8>,
    foundPtr@entry=<error reading variable: Cannot access memory at address 0x7ffd021ac418>) at dynahash.c:948

and

#0  hash_search_with_hash_value (hashp=0xc70dc8, hashp@entry=<error reading variable: Cannot access memory at address 0x7ffd021ac318>,
    keyPtr=<error reading variable: Cannot access memory at address 0x7ffd021ac298>, keyPtr@entry=<error reading variable: Cannot access memory at address 0x7ffd021ac318>, hashvalue=4209549525,
    hashvalue@entry=<error reading variable: Cannot access memory at address 0x7ffd021ac318>, action=<error reading variable: Cannot access memory at address 0x7ffd021ac2a0>,
    action@entry=<error reading variable: Cannot access memory at address 0x7ffd021ac318>, foundPtr=<error reading variable: Cannot access memory at address 0x7ffd021ac2a8>,
    foundPtr@entry=<error reading variable: Cannot access memory at address 0x7ffd021ac318>) at dynahash.c:954
954	dynahash.c: No such file or directory.

and

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: ***(22976)'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000007d88bd in get_hash_entry (freelist_idx=0, hashp=0xca3848) at    
1283    dynahash.c: No such file or directory.

Interestingly, dropping the extension didn't help -- only complete removal (from shared_preload_libraries).

As already discussed with @rjuju in Twitter, I also was able to reproduce the problem well using the same approach as described in powa-team/pg_qualstats#20 (running pgbench with -c200).

@tbe I've checked the open issues in the project in the morning and didn't see any. This bug hurts a lot and it would be good if the issue remained open for a while, for better visibility. What happened, it was in "Open" state only for 4 hours and it's already closed, although the bugfix is not yet released. And as I understand, the bugfix is not yet fully confirmed.

@rjuju -- thank you for sending me the direct link to this. I'll try to find time for testing against the master version.

rjuju commented

@NikolayS agreed, I just reopened the issue until the release is done.

Interestingly, dropping the extension didn't help -- only complete removal (from shared_preload_libraries).

Yes that's unfortunately intended. Creating/dropping the extension only handles the sql wrappers to access the shared memory content. Having it in shared_preload_liraries is what makes the rest of the code (the hooks and all the processing) being run.

tbe commented

@rjuju the master branch is running on the master now for some time without any issue. Looks good

rjuju commented

Thanks for confirming @tbe! I'll wait for confirmation from @NikolayS too and push a release!

anse1 commented
rjuju commented

Thanks a lot @anse1 !

@rjuju I've also checked 9f36117 โ€“ no errors.

rjuju commented

Thanks a lot @NikolayS and everyone!

I just pushed v2.1.3!

@rjuju any info on ETA of the 2.1.3 RPM?

rjuju commented

I think I already took care of everything needed on the repository side.

@devrimgunduz do you need anything else on the repo for the new RPM?

I am unable to work because of a surgery. Will try pushing RPMs on Friday.

I just got the same problem I think using the latest pg_stat_kcache version, 2.1.2 and PG version 12.3 on CentOS 7

PG LOG error:

2020-07-22 18:00:45.713 UTC [23573] ERROR:  invalid memory alloc request size 4778771801457815560
2020-07-22 18:00:45.713 UTC [23582] ERROR:  invalid memory alloc request size 4778771801457815560
2020-07-22 18:01:02.340 UTC [11791] LOG:  server process (PID 23572) was terminated by signal 11: Segmentation fault
2020-07-22 18:01:02.340 UTC [11791] DETAIL:  Failed process was running: select count(transactio0_.trans_manager_id) as col_0_0_ from ...[edited, removed for security]
2020-07-22 18:01:02.340 UTC [11791] LOG:  terminating any other active server processes

dmesg -T --> [Wed Jul 22 15:18:39 2020] Pid 20578(postgres) over core_pipe_limit

I restarted the PG cluster while removing pg_stat_kcache from shared_preload_libraries and it has been running fine for almost an hour. Previously, it was crashing every 5 minutes.

I plan to re-add pg_stat_kcache to see if it causes the frequent crashes again. Will report back here today or early tomorrow.

I re-added the pg_stat_kcache extension and restarted the server. Within a minute we got the segmentation errors again. So it is most definitely the extension causing PG to crash.

At this point, I will wait until an official version comes out and not use the recent patch that was applied. We want to be sure this problem won't reoccur in production env.

Btw, in our configuration we have tons of memory and had shared_buffers set to 128GB and high values for the other memory parameters in addition to increased workers and parallel workers in particular. There was a lot of App connectivity going on through PGBOUNCER but very little load on the PG instance.

rjuju commented

Take care @devrimgunduz-edb !

rjuju commented

@MichaelDBA if you have the RPM v2.1.2, it's clear that the bug will reoccur. Waiting for the new version is recommended, but you could also downgrade to 2.1.1.

Pushed packages for RHEL 7.

Thanks a lot @devrimgunduz