Segmentation fault

Question

Segmentation fault

Closed this issue 4 years ago · 34 comments

On one of our systems, stat_kache triggers a segfault:

PG: 11.8, kcache: 2.1.2

(gdb) bt
#0  0x000000000085c3d7 in get_hash_entry (freelist_idx=0, hashp=0xd74d50) at dynahash.c:1322
#1  hash_search_with_hash_value (hashp=0xd74d50, keyPtr=keyPtr@entry=0x7fffffffd360, hashvalue=492298745, action=action@entry=HASH_ENTER, foundPtr=foundPtr@entry=0x7fffffffd270) at dynahash.c:1055
#2  0x000000000085c46d in hash_search (hashp=<optimized out>, keyPtr=keyPtr@entry=0x7fffffffd360, action=action@entry=HASH_ENTER, foundPtr=foundPtr@entry=0x7fffffffd270) at dynahash.c:911
#3  0x00007fffe8834c85 in pgsk_entry_alloc (key=key@entry=0x7fffffffd360) at pg_stat_kcache.c:634
#4  0x00007fffe8835640 in pgsk_entry_store (queryId=<optimized out>, counters=...) at pg_stat_kcache.c:587
#5  pgsk_ExecutorEnd (queryDesc=0xddfd40) at pg_stat_kcache.c:806
#6  0x00000000005bdb24 in PortalCleanup (portal=<optimized out>) at portalcmds.c:301
#7  0x00000000008797aa in PortalDrop (portal=0xe19970, isTopCommit=<optimized out>) at portalmem.c:499
#8  0x0000000000746008 in exec_simple_query (query_string=0xd783e0 "SELECT datname FROM pg_database WHERE datistemplate = false;") at postgres.c:1155
#9  0x00000000007472ca in PostgresMain (argc=<optimized out>, argv=argv@entry=0xdc5a90, dbname=0xdc59b8 "postgres", username=<optimized out>) at postgres.c:4193
#10 0x00000000004807f6 in BackendRun (port=0xdc3420) at postmaster.c:4364
#11 BackendStartup (port=0xdc3420) at postmaster.c:4036
#12 ServerLoop () at postmaster.c:1707
#13 0x00000000006d9fd9 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0xd72d20) at postmaster.c:1380
#14 0x000000000048167f in main (argc=5, argv=0xd72d20) at main.c:228

This happens on a hot-standby system

Answer 1 · 2020-07-15T09:48:37.000Z

Hello. Thanks for the report. I see that there's an issue in the shared_memory size computation :(.

Answer 2 · 2020-07-15T11:49:56.000Z

Do you need any help, additional infos, configuration settings? The strange thing is, we only have an issue at this one cluster, all other clusters with the same versions are running stable.

Answer 3 · 2020-07-15T11:50:50.000Z

Commit 19e1d44 should fix the issue.

I tried to reproduce the problem locally but I've been unable to so far. If you have a test environment where you already hit the issue, could you check if this is now ok?

Answer 4 · 2020-07-15T11:53:25.000Z

The strange thing is, we only have an issue at this one cluster, all other clusters with the same versions are running stable.

This isn't especially surprising. You probably can only hit this issue if you have almost all possible connections active and if all other hash table in shared memory have already expanded to their full size (by default the lock table and other hash table are only initialized to half of the total size, so you need to acquire enough lock at the same time to grow it).

Answer 5 · 2020-07-15T12:20:52.000Z

The change does not solve the issue here.

In the meantime, we have discovered, that the master also has an issue, trying to allocate HUGE amounts of memory:

2020-07-15 11:43:24 CEST [17884]: [6-1] user=postgres,db=postgres_backup,xid=0,vxid=32/3,app=[unknown] STATEMENT:  SELECT "pg_stat_io_user_tables"."schemaname", "pg_stat_io_user_tables"."relname", "pg_stat_io_user_tables"."heap_blks_read
", "pg_stat_io_user_tables"."heap_blks_hit", "pg_stat_io_user_tables"."idx_blks_read", "pg_stat_io_user_tables"."idx_blks_hit", "pg_stat_io_user_tables"."toast_blks_read", "pg_stat_io_user_tables"."toast_blks_hit", "pg_stat_io_user_table
s"."tidx_blks_read", "pg_stat_io_user_tables"."tidx_blks_hit" FROM pg_statio_user_tables AS "pg_stat_io_user_tables" WHERE (schemaname NOT LIKE 'pg_temp_%')
2020-07-15 11:43:24 CEST [17879]: [3-1] user=postgres,db=usermgmt,xid=0,vxid=24/2,app=[unknown] ERROR:  invalid memory alloc request size 16913468333384398648

We can not test the fix on this side, at this time, as we would have to restart the master for that.

You probably can only hit this issue if you have almost all possible connections active and if all other hash table in shared memory have already expanded to their full size (by default the lock table and other hash table are only initialized to half of the total size, so you need to acquire enough lock at the same time to grow it).

The segmentation fault happens on the standby, where nearly no connection is open, expect for our prometheus exporter and monitoring.

What we can see in the backtrace is, that the nentries member of the hclt seems corrupt:

{freeList = {{mutex = 56 '8', nentries = 8524192900200068020, freeList = 0xb2bd2eb1a7292638}, {mutex = 56 '8', nentries = -5567242223811484104, freeList = 0x764c04479554f3b4}, {mutex = 56 '8', nentries = 8524192900200068020, 
      freeList = 0xb2bd2eb1a7292638}, {mutex = 56 '8', nentries = 8524192900200068020, freeList = 0xb2bd2eb1a7292638}, {mutex = 56 '8', nentries = -5567242223811484104, freeList = 0x764c04479554f3b4}, {mutex = 180 '\264', 
      nentries = 8524192900200068020, freeList = 0x764c04479554f3b4}, {mutex = 56 '8', nentries = 8524192900200068020, freeList = 0x764c04479554f3b4}, {mutex = 180 '\264', nentries = 8524192900200068020, freeList = 0x764c04479554f3b4}, 
    {mutex = 180 '\264', nentries = 8524192900200068020, freeList = 0x764c04479554f3b4}, {mutex = 180 '\264', nentries = 0, freeList = 0x0}, {mutex = 0 '\000', nentries = 0, freeList = 0x0} <repeats 22 times>}, dsize = 256, nsegs = 32, 
  max_bucket = 8191, high_mask = 16383, low_mask = 8191, keysize = 16, entrysize = 128, num_partitions = 0, ffactor = 1, max_dsize = 256, ssize = 256, sshift = 8, nelem_alloc = 56}```

Answer 6 · 2020-07-15T12:29:55.000Z

We can not test the fix on this side, at this time, as we would have to restart the master for that.

We decided to give it a try, and tested this as hotfix on the master. The above memory allocation errors are gone, we will monitor it a little bit longer. So, that fixed half of our issue.

We are analyzing this issue with support of @anse1 on the affected system.

Answer 7 · 2020-07-15T12:30:20.000Z

The change does not solve the issue here.

Just to be sure, did you restart postgres after compiling the new binary?

In the meantime, we have discovered, that the master also has an issue, trying to allocate HUGE amounts of memory:

This looks like a corrupted field somewhere. Is the error reproducible, and if yes can you get the error message doing first SET log_error_verbosity = verbose;

What we can see in the backtrace is, that the nentries member of the hclt seems corrupt:

Indeed. The fact that the allocated shared memory size was previously too low could lead pg_stat_kcache to overwrite wrong memory parts.

Answer 8 · 2020-07-15T12:48:04.000Z

First:

We decided to give it a try, and tested this as hotfix on the master. The above memory allocation errors are gone, we will monitor it a little bit longer.

The error happens again. log_error_verbository exposes just:

2020-07-15 14:39:06 CEST [11695]: [35-1] user=app_vault,db=vault,xid=754720649,vxid=10/302,app=[unknown] ERROR:  XX000: invalid memory alloc request size 12922605294342592712
2020-07-15 14:39:06 CEST [11695]: [36-1] user=app_vault,db=vault,xid=754720649,vxid=10/302,app=[unknown] LOCATION:  palloc, mcxt.c:934

Just to be sure, did you restart postgres after compiling the new binary?

Yes, we start the instance in the foreground for each test.

Indeed. The fact that the allocated shared memory size was previously too low could lead pg_stat_kcache to overwrite wrong memory parts.

-Yeah, but the other members of the struct are most likely valid.- EDIT: No, they are corrupt too, the outer struct is valid, everything in hctl seems random.

Due a hypervisor crash last week, it is possible that we have corruption in the catalog (i don't trust vmware ...). But the crashes and allocations are only triggered if stat_kcache is loaded.

Answer 9 · 2020-07-15T13:16:51.000Z

The memory allocation errors are most likely triggered here: https://github.com/powa-team/pg_stat_kcache/blob/master/pg_stat_kcache.c#L686

If the nentries field is corrupted, this will try to allocate this insane amount of memory.

I have verified the kcache stat file, to make sure there is no corruption there:

$ hexdump pg_stat/pg_stat_kcache.stat 
0000000 6e10 0d75 0000 0000                    
0000008

BTW:
The while loop in https://github.com/powa-team/pg_stat_kcache/blob/master/pg_stat_kcache.c#L630 compares pgsk_max (int) with the result of hash_get_num_entries (long).

Answer 10 · 2020-07-15T13:23:25.000Z

I just found another issue which would explain the symptoms you have, fixed with 9f36117. I went through the rest of the code and didn't spot any other issue.

Answer 11 · 2020-07-15T13:24:39.000Z

I just found another issue which would explain the symptoms you have, fixed with 9f36117. I went through the rest of the code and didn't spot any other issue.

We'll give it a try

Answer 12 · 2020-07-15T13:26:17.000Z

BTW:
The while loop in https://github.com/powa-team/pg_stat_kcache/blob/master/pg_stat_kcache.c#L630 compares pgsk_max (int) with the result of hash_get_num_entries (long).

Indeed, but this shouldn't be an issue as pgss_max can't exceed INT_MAX. FTR, this is using the same code as for upstream pg_stat_statements

static int	pgss_max;			/* max # statements to track */
[...]
static pgssEntry *
entry_alloc(pgssHashKey *key, Size query_offset, int query_len, int encoding,
			bool sticky)
{
	pgssEntry  *entry;
	bool		found;

	/* Make space if needed */
	while (hash_get_num_entries(pgss_hash) >= pgss_max)
		entry_dealloc();
[...]

Answer 13 · 2020-07-15T13:37:07.000Z

I just found another issue which would explain the symptoms you have, fixed with 9f36117. I went through the rest of the code and didn't spot any other issue.

Looks good on the standby atm. We are monitoring this a little bit longer, but as the original issue happened within a few seconds after start, it looks very promising.

Answer 14 · 2020-07-15T13:51:35.000Z

Looks still good. Thank you for the fast solution. Closing this issue for now, we will test on the master tomorrow, and reopen if needed.

Answer 15 · 2020-07-15T14:01:36.000Z

When will this be pushed to the master branch?

Answer 16 · 2020-07-15T14:03:18.000Z

This is already pushed to master branch, so you can use the fix right now. I'll wait until tomorrow if @tbe confirms that there's still no issue and push a new release.

Answer 17 · 2020-07-15T14:06:48.000Z

Closing this issue for now, we will test on the master tomorrow, and reopen if needed.

Thanks a lot @tbe !

Answer 18 · 2020-07-15T22:07:28.000Z

Also got it on CentOS 7 and Postgres 9.6 (doesn't matter which - saw it on 9.6.3, 9.6.12, 9.6.18) -- once the package was updated (to pg_stat_kcache96-2.1.2-1.rhel7.x86_64; SQL in databases remained on 2.1.1), started to get a bunch of various errors:

ERROR: invalid memory alloc request size 16032070410878933784 (various values here)
sometimes PANIC: hash table "pg_stat_kcache hash" corrupted
finally, not that often, but sometimes segfaults. So far noticed 3 slightly different cases, all failing in dynahash.c (unfortunately, I don't have full backtraces now):

Cannot access memory at address 0x7f1a72ea1128
Cannot access memory at address 0x7f1a72ea1120
Failed to read a valid object file image from memory.
Core was generated by `postgres: ***(55554)           '.
Program terminated with signal 11, Segmentation fault.
#0  hash_search_with_hash_value (hashp=0xca3338, hashp@entry=<error reading variable: Cannot access memory at address 0x7ffd021ac418>,
    keyPtr=<error reading variable: Cannot access memory at address 0x7ffd021ac398>, keyPtr@entry=<error reading variable: Cannot access memory at address 0x7ffd021ac418>, hashvalue=313889387,
    hashvalue@entry=<error reading variable: Cannot access memory at address 0x7ffd021ac418>, action=<error reading variable: Cannot access memory at address 0x7ffd021ac3a0>,
    action@entry=<error reading variable: Cannot access memory at address 0x7ffd021ac418>, foundPtr=<error reading variable: Cannot access memory at address 0x7ffd021ac3a8>,
    foundPtr@entry=<error reading variable: Cannot access memory at address 0x7ffd021ac418>) at dynahash.c:948

and

#0  hash_search_with_hash_value (hashp=0xc70dc8, hashp@entry=<error reading variable: Cannot access memory at address 0x7ffd021ac318>,
    keyPtr=<error reading variable: Cannot access memory at address 0x7ffd021ac298>, keyPtr@entry=<error reading variable: Cannot access memory at address 0x7ffd021ac318>, hashvalue=4209549525,
    hashvalue@entry=<error reading variable: Cannot access memory at address 0x7ffd021ac318>, action=<error reading variable: Cannot access memory at address 0x7ffd021ac2a0>,
    action@entry=<error reading variable: Cannot access memory at address 0x7ffd021ac318>, foundPtr=<error reading variable: Cannot access memory at address 0x7ffd021ac2a8>,
    foundPtr@entry=<error reading variable: Cannot access memory at address 0x7ffd021ac318>) at dynahash.c:954
954	dynahash.c: No such file or directory.

and

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: ***(22976)'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000007d88bd in get_hash_entry (freelist_idx=0, hashp=0xca3848) at    
1283    dynahash.c: No such file or directory.

Interestingly, dropping the extension didn't help -- only complete removal (from shared_preload_libraries).

As already discussed with @rjuju in Twitter, I also was able to reproduce the problem well using the same approach as described in powa-team/pg_qualstats#20 (running pgbench with -c200).

Answer 19 · 2020-07-15T22:11:41.000Z

@tbe I've checked the open issues in the project in the morning and didn't see any. This bug hurts a lot and it would be good if the issue remained open for a while, for better visibility. What happened, it was in "Open" state only for 4 hours and it's already closed, although the bugfix is not yet released. And as I understand, the bugfix is not yet fully confirmed.

@rjuju -- thank you for sending me the direct link to this. I'll try to find time for testing against the master version.

Answer 20 · 2020-07-16T04:30:16.000Z

@NikolayS agreed, I just reopened the issue until the release is done.

Interestingly, dropping the extension didn't help -- only complete removal (from shared_preload_libraries).

Yes that's unfortunately intended. Creating/dropping the extension only handles the sql wrappers to access the shared memory content. Having it in shared_preload_liraries is what makes the rest of the code (the hooks and all the processing) being run.

Answer 21 · 2020-07-16T08:56:00.000Z

@rjuju the master branch is running on the master now for some time without any issue. Looks good

Answer 22 · 2020-07-16T09:49:45.000Z

Thanks for confirming @tbe! I'll wait for confirmation from @NikolayS too and push a release!

Answer 23 · 2020-07-16T12:50:50.000Z

FWIW, when testing with sqlsmith I get crashes with tbe's backtrace after a couple 1000 queries with 717300b. With 9f36117, there were no problems for 30 Mio queries.

Answer 24 · 2020-07-16T13:06:50.000Z

Thanks a lot @anse1 !

Answer 25 · 2020-07-16T14:09:10.000Z

@rjuju I've also checked 9f36117 – no errors.

Answer 26 · 2020-07-16T16:00:54.000Z

Thanks a lot @NikolayS and everyone!

I just pushed v2.1.3!

Answer 27 · 2020-07-22T13:24:11.000Z

@rjuju any info on ETA of the 2.1.3 RPM?

Answer 28 · 2020-07-22T13:47:51.000Z

I think I already took care of everything needed on the repository side.

@devrimgunduz do you need anything else on the repo for the new RPM?

Answer 29 · 2020-07-22T15:21:13.000Z

I am unable to work because of a surgery. Will try pushing RPMs on Friday.

Answer 30 · 2020-07-22T22:28:06.000Z

I just got the same problem I think using the latest pg_stat_kcache version, 2.1.2 and PG version 12.3 on CentOS 7

PG LOG error:

2020-07-22 18:00:45.713 UTC [23573] ERROR:  invalid memory alloc request size 4778771801457815560
2020-07-22 18:00:45.713 UTC [23582] ERROR:  invalid memory alloc request size 4778771801457815560
2020-07-22 18:01:02.340 UTC [11791] LOG:  server process (PID 23572) was terminated by signal 11: Segmentation fault
2020-07-22 18:01:02.340 UTC [11791] DETAIL:  Failed process was running: select count(transactio0_.trans_manager_id) as col_0_0_ from ...[edited, removed for security]
2020-07-22 18:01:02.340 UTC [11791] LOG:  terminating any other active server processes

dmesg -T --> [Wed Jul 22 15:18:39 2020] Pid 20578(postgres) over core_pipe_limit

I restarted the PG cluster while removing pg_stat_kcache from shared_preload_libraries and it has been running fine for almost an hour. Previously, it was crashing every 5 minutes.

I plan to re-add pg_stat_kcache to see if it causes the frequent crashes again. Will report back here today or early tomorrow.

I re-added the pg_stat_kcache extension and restarted the server. Within a minute we got the segmentation errors again. So it is most definitely the extension causing PG to crash.

At this point, I will wait until an official version comes out and not use the recent patch that was applied. We want to be sure this problem won't reoccur in production env.

Btw, in our configuration we have tons of memory and had shared_buffers set to 128GB and high values for the other memory parameters in addition to increased workers and parallel workers in particular. There was a lot of App connectivity going on through PGBOUNCER but very little load on the PG instance.

Answer 31 · 2020-07-23T10:09:08.000Z

Take care @devrimgunduz-edb !

Answer 32 · 2020-07-23T10:10:06.000Z

@MichaelDBA if you have the RPM v2.1.2, it's clear that the bug will reoccur. Waiting for the new version is recommended, but you could also downgrade to 2.1.1.

Answer 33 · 2020-07-23T11:29:06.000Z

Pushed packages for RHEL 7.

Answer 34 · 2020-07-23T13:25:46.000Z

Thanks a lot @devrimgunduz