slact/nchan

SIGSEGV in ngx_http_charset_recode_to_utf8, Possibly Caused by nchan_unsubscribe_request

Opened this issue · 0 comments

I'm seeing a fairly reproducible failure that seems linked to use of the nchan_unsubscribe_request directive.

Here's my nginx.conf:

user www-data;
worker_processes auto;
pid /run/nginx.pid;
include /etc/nginx/modules-enabled/*.conf;
daemon off;
error_log /dev/stdout info;

worker_rlimit_core  500M;
working_directory /tmp;

events {
	worker_connections 768;
}
http {
	sendfile on;
	tcp_nopush on;
	tcp_nodelay on;
	keepalive_timeout 65;
	types_hash_max_size 2048;

	include /etc/nginx/mime.types;
	default_type application/octet-stream;

	#gzip on;
	#gzip_types application/json;

	upstream helpable {
		server localhost:8888;
	}
	server {
		listen 80 default_server;
		listen [::]:80 default_server;
		server_name _;
		location / {
			proxy_pass http://helpable;
		}
		location ~ /internal/subscribe/(.*)$ {
			internal;
			nchan_subscriber websocket;
			nchan_channel_id "$1";
			nchan_channel_id_split_delimiter ",";
			nchan_subscribe_request /internal/connected;
			nchan_unsubscribe_request /internal/disconnected;
		}
		location = /internal/connected {
			proxy_pass http://helpable/presence/connected;
			proxy_set_header X-Channel-Id $nchan_channel_id;
		}
		location = /internal/disconnected {
			proxy_pass http://helpable/presence/disconnected;
			proxy_ignore_client_abort on;
			proxy_set_header X-Channel-Id $nchan_channel_id;
		}
		location ~ /publish/(\w+)$ {
			allow 127.0.0.1;
			deny all;
			nchan_publisher http;
			nchan_channel_id "$1";
		}
		location /nchan_stub_status {
    		        nchan_stub_status;
  		}
	}
}

As you can see, I have an internal route for enabling multiple subscriptions, and have enabled presence detection using nchan_(un)subscribe_request.

I have a simple test client that subscribes to three channels, and a server that repeatedly sends messages out over all three of those channels. But after only a few page reloads on the client, this happens:

api-api-1      | 2021/12/31 02:31:34 [notice] 29#29: signal 17 (SIGCHLD) received from 37
api-api-1      | 2021/12/31 02:31:34 [alert] 29#29: worker process 37 exited on signal 11 (core dumped)
api-api-1      | 2021/12/31 02:31:34 [notice] 29#29: start worker process 98
api-api-1      | 2021/12/31 02:31:34 [notice] 29#29: signal 29 (SIGIO) received

Note that this always seems to happen in between a disconnection and a reconnection. If the client connects, everything works, and I can see the messages arriving. But after a few reloads, nothing, even though the client reports obtaining a connection.

I'm using the latest stable version of nginx (1.20) and the latest version of nchan (1.2.15). Earlier I was using an older version of both, and seeming some strange behavior with receiving multiple copies of certain messages, and so I upgraded. I don't know if this problem was affecting the earlier version or not. But I didn't notice it.

On the client I'm using the npm reconnecting-websocket library, rather than the one provided by nchan. But that shouldn't cause a server-side crash.

I rebuilt nginx and nchan with debugging symbols and was able to load the core dump produced when the worker crashed. Here's the stack trace:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000aaaab8df9704 in ngx_http_charset_recode_to_utf8 (ctx=0xaaaac55c6380, buf=0xaaaac55e6478, pool=0xaaaac5644970) at src/http/modules/ngx_http_charset_filter_module.c:977
977             if (table[*src * NGX_UTF_LEN] == '\1') {
(gdb) bt
#0  0x0000aaaab8df9704 in ngx_http_charset_recode_to_utf8 (ctx=0xaaaac55c6380, buf=0xaaaac55e6478, pool=0xaaaac5644970) at src/http/modules/ngx_http_charset_filter_module.c:977
#1  ngx_http_charset_body_filter (r=0xaaaac56449c0, in=<optimized out>) at src/http/modules/ngx_http_charset_filter_module.c:584
#2  0x0000aaaab8dfae50 in ngx_http_sub_body_filter (r=0xaaaab8d80158 <ngx_palloc+24>, in=0xaaaac55e6430) at src/http/modules/ngx_http_sub_filter_module.c:299
#3  0x0000aaaab8dfbf14 in ngx_http_addition_body_filter (r=0xaaaac56449c0, in=0xaaaac55e6430) at src/http/modules/ngx_http_addition_filter_module.c:149
#4  0x0000aaaab8dfc348 in ngx_http_gunzip_body_filter (r=0xaaaac56449c0, in=0xaaaac55e6430) at src/http/modules/ngx_http_gunzip_filter_module.c:185
#5  0x0000aaaab8dfe420 in ngx_http_trailers_filter (r=0xaaaac56449c0, in=0xaaaac55e6430) at src/http/modules/ngx_http_headers_filter_module.c:264
#6  0x0000aaaab8d82e28 in ngx_output_chain (ctx=ctx@entry=0xaaaac55e63a0, in=in@entry=0xaaaac55e6468) at src/core/ngx_output_chain.c:234
#7  0x0000aaaab8dff094 in ngx_http_copy_filter (r=0xaaaac56449c0, in=0xaaaac55e6468) at src/http/ngx_http_copy_filter_module.c:152
#8  0x0000aaaab8df26f0 in ngx_http_range_body_filter (r=0xaaaac56449c0, in=0xaaaac55e6468) at src/http/modules/ngx_http_range_filter_module.c:635
#9  0x0000aaaab8dffb28 in ngx_http_slice_body_filter (r=0xaaaac56449c0, in=<optimized out>) at src/http/modules/ngx_http_slice_filter_module.c:228
#10 0x0000aaaab8dc5e6c in ngx_http_output_filter (r=r@entry=0xaaaac56449c0, in=in@entry=0xaaaac55e6468) at src/http/ngx_http_core_module.c:1863
#11 0x0000aaaab8e74290 in nchan_output_filter_generic (r=0xaaaac56449c0, msg=msg@entry=0x0, in=0xaaaac55e6468) at ../nchan-1.2.15//src/util/nchan_output.c:261
#12 0x0000aaaab8e74808 in nchan_output_filter (r=<optimized out>, in=<optimized out>) at ../nchan-1.2.15//src/util/nchan_output.c:300
#13 0x0000aaaab8e8a754 in ws_output_filter (fsub=fsub@entry=0xaaaac55dd7c0, chain=<optimized out>) at ../nchan-1.2.15//src/subscribers/websocket.c:425
#14 0x0000aaaab8e8bce8 in websocket_send_close_frame (fsub=fsub@entry=0xaaaac55dd7c0, code=code@entry=1000, err=err@entry=0xffffeddc1aa8)
    at ../nchan-1.2.15//src/subscribers/websocket.c:1865
#15 0x0000aaaab8e8bd7c in websocket_send_close_frame_cstr (fsub=fsub@entry=0xaaaac55dd7c0, code=code@entry=1000, err=err@entry=0xaaaab8ed07c0 "410 Gone")
    at ../nchan-1.2.15//src/subscribers/websocket.c:1856
#16 0x0000aaaab8e8deb8 in websocket_dequeue (self=0xaaaac55dd7c0) at ../nchan-1.2.15//src/subscribers/websocket.c:1242
#17 0x0000aaaab8e8aa18 in websocket_finalize_request (fsub=fsub@entry=0xaaaac55dd7c0) at ../nchan-1.2.15//src/subscribers/websocket.c:292
#18 0x0000aaaab8e8b564 in websocket_reading_finalize (r=r@entry=0xaaaac56449c0) at ../nchan-1.2.15//src/subscribers/websocket.c:1308
#19 0x0000aaaab8e8c8ac in websocket_reading (r=0xaaaac56449c0) at ../nchan-1.2.15//src/subscribers/websocket.c:1349
#20 0x0000aaaab8dca4cc in ngx_http_request_handler (ev=0xaaaac5620a00) at src/http/ngx_http_request.c:2400
#21 0x0000aaaab8daec30 in ngx_epoll_process_events (cycle=0xaaaac55e1af0, timer=<optimized out>, flags=1) at src/event/modules/ngx_epoll_module.c:901
#22 0x0000aaaab8da1e44 in ngx_process_events_and_timers (cycle=cycle@entry=0xaaaac55e1af0) at src/event/ngx_event.c:247
#23 0x0000aaaab8dac8d4 in ngx_worker_process_cycle (cycle=0xaaaac55e1af0, data=<optimized out>) at src/os/unix/ngx_process_cycle.c:719
#24 0x0000aaaab8daaa2c in ngx_spawn_process (cycle=cycle@entry=0xaaaac55e1af0, proc=proc@entry=0xaaaab8dac7b4 <ngx_worker_process_cycle>, data=data@entry=0x0,
    name=name@entry=0xaaaab8ec87e0 "worker process", respawn=respawn@entry=-3) at src/os/unix/ngx_process.c:199
#25 0x0000aaaab8dab98c in ngx_start_worker_processes (cycle=cycle@entry=0xaaaac55e1af0, n=4, type=type@entry=-3) at src/os/unix/ngx_process_cycle.c:344
#26 0x0000aaaab8dad2a0 in ngx_master_process_cycle (cycle=0xaaaac55e1af0) at src/os/unix/ngx_process_cycle.c:130
#27 0x0000aaaab8d7e4f8 in main (argc=0, argv=<optimized out>) at src/core/nginx.c:383

I'm happy to send along the core dump itself, or provide more information if needed.

A few things I've noticed that might be helpful:

  1. As stated in the title, this seems related to nchan_unsubscribe_request. Specifically, if I remove that directive, I can't reproduce the error. (I'm not sure it's gone, but at least it's much harder to reproduce.) However, unfortunately I need that feature for my application.
  2. I suspect that this is a race condition between publishing and disconnection, because if I increase the rate at which the publisher is generating messages the error seems to become easier to trigger, and if I reduce it, harder. But this is just a guess. (And somewhat surprising, since these must be pretty fast operations?)

On a side note, I just started experimenting with nchan over the past few days. I was pretty excited about integrating it into my application, and was making some progress in that direction. But I'm worried now since it seems that certain core features aren't stable, or at least maybe not on certain versions of nginx? If I can crash this with just a few page reloads, it's clearly not going to hold up in production. I appreciate that this is complex stuff, and nginx is a quickly-moving target. And that there is a new release of nchan planned, although I'm not sure I want to move to another server platform.

All to say: Is there a particular combination of nginx and nchan that are known to achieve a good degree of stability? What I'm creating will end up all wrapped together in a container, so I'm very flexible about which version of nginx I use.

Thanks in advance, and for all of your hard work on this project.