sctplab/usrsctp

Deadlock: usrsctp_conninput and user_sctp_timer_iterate

eremeev opened this issue · 3 comments

It seems my application dead locks.
5f3540a (commit) is used.

SCTP runs over DTLS/UDP.
I used https://github.com/jitsi/jitsi-sctp/blob/master/jniwrapper/native/src/org_jitsi_modified_sctp4j_SctpJni.c as template, put receive callback and send threshold callback to usrsctp_socket and send callback to usrsctp_init.

It seems user_sctp_timer_iterate dead locks with usrsctp_conninput.
usrsctp_conninput and usrsctp_close run on DTLS threads (IO threads).
SctpSocket can be accessed only from one thread.

Please, see thread dumps:

Thread 1607 (Thread 0x7f4524e7c700 (LWP 60587)):
#0  0x00007f45c3e784ed in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f45c3e73dcb in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007f45c3e73c98 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f4527b24873 in sctp_invoke_recv_callback (inp=0x7f44c8ea4200, stcb=0x7f44c8ed4400, control=0x7f44df79f120, inp_read_lock_held=0) at netinet/sctputil.c:5328
#4  0x00007f4527b255fc in sctp_add_to_readq (inp=0x7f44c8ea4200, stcb=0x7f44c8ed4400, control=0x7f44df79f120, sb=0x7f44c8e5beb8, end=1, inp_read_lock_held=0, so_locked=0) at netinet/sctputil.c:5435
#5  0x00007f4527b1e5e9 in sctp_notify_send_failed (stcb=0x7f44c8ed4400, sent=1 '\001', error=0, chk=0x7f455f741680, so_locked=0) at netinet/sctputil.c:3627
#6  0x00007f4527b208bc in sctp_ulp_notify (notification=5, stcb=0x7f44c8ed4400, error=0, data=0x7f455f741680, so_locked=0) at netinet/sctputil.c:4318
#7  0x00007f4527b211c2 in sctp_report_all_outbound (stcb=0x7f44c8ed4400, error=0, so_locked=0) at netinet/sctputil.c:4474
#8  0x00007f4527b223b1 in sctp_abort_notification (stcb=0x7f44c8ed4400, from_peer=false, timeout=false, error=0, abort=0x0, so_locked=0) at netinet/sctputil.c:4565
#9  0x00007f4527b2274f in sctp_abort_an_association (inp=0x7f44c8ea4200, stcb=0x7f44c8ed4400, op_err=0x0, timedout=false, so_locked=0) at netinet/sctputil.c:4755
#10 0x00007f4527aa48c5 in sctp_chunk_retransmission (inp=0x7f44c8ea4200, stcb=0x7f44c8ed4400, asoc=0x7f44c8ed4458, cnt_out=0x7f4524e7a550, now=0x7f4524e7a590, now_filled=0x7f4524e7a558, fr_done=0x7f4524e7a55c, so_locked=0) at netinet/sctp_output.c:10184
#11 0x00007f4527aa5cbd in sctp_chunk_output (inp=0x7f44c8ea4200, stcb=0x7f44c8ed4400, from_where=1, so_locked=0) at netinet/sctp_output.c:10648
#12 0x00007f4527b174a3 in sctp_timeout_handler (t=0x7f44c8e5b030) at netinet/sctputil.c:1917
#13 0x00007f4527a74ddc in sctp_handle_tick (elapsed_ticks=10) at netinet/sctp_callout.c:172
#14 0x00007f4527a75028 in user_sctp_timer_iterate (arg=0x0) at netinet/sctp_callout.c:214
#15 0x00007f45c3e71dd5 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f45c3996ead in clone () from /lib64/libc.so.6
Thread 811 (Thread 0x7f448b02c700 (LWP 61482)):
#0  0x00007f45c3e784ed in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f45c3e73dcb in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007f45c3e73c98 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f4527b222de in sctp_abort_notification (stcb=0x7f44c8ed4400, from_peer=false, timeout=false, error=0, abort=0x0, so_locked=0) at netinet/sctputil.c:4562
#4  0x00007f4527b2274f in sctp_abort_an_association (inp=0x7f44c8ea4200, stcb=0x7f44c8ed4400, op_err=0x0, timedout=false, so_locked=0) at netinet/sctputil.c:4755
#5  0x00007f4527aa48c5 in sctp_chunk_retransmission (inp=0x7f44c8ea4200, stcb=0x7f44c8ed4400, asoc=0x7f44c8ed4458, cnt_out=0x7f448b029be0, now=0x7f448b029c20, now_filled=0x7f448b029be8, fr_done=0x7f448b029bec, so_locked=0) at netinet/sctp_output.c:10184
#6  0x00007f4527aa5cbd in sctp_chunk_output (inp=0x7f44c8ea4200, stcb=0x7f44c8ed4400, from_where=3, so_locked=0) at netinet/sctp_output.c:10648
#7  0x00007f4527a8ada9 in sctp_common_input_processing (mm=0x7f448b029e80, iphlen=0, offset=132, length=132, src=0x7f448b029ea0, dst=0x7f448b029eb0, sh=0x7f44c622ac50, ch=0x7f44c622ac5c, compute_crc=1 '\001', ecn_bits=0 '\000', vrf_id=0, port=0) at netinet/sctp_input.c:6155
#8  0x00007f4527a723a4 in usrsctp_conninput (addr=0x2144, buffer=0x7f44c622b800, length=132, ecn_bits=0 '\000') at user_socket.c:3336

I have plenty of threads stuck with the following stack:

Thread 1133 (Thread 0x7f455827b700 (LWP 61126)):
#0  0x00007f45c3e752ce in pthread_rwlock_wrlock () from /lib64/libpthread.so.0
#1  0x00007f4527ac4cd2 in sctp_inpcb_free (inp=0x7f44fb7c2600, immediate=1, from=1) at netinet/sctp_pcb.c:3907
#2  0x00007f4527adcf4f in sctp_close (so=0x7f44ef6cf380) at netinet/sctp_usrreq.c:855
#3  0x00007f4527a68cb8 in sofree (so=0x7f44ef6cf380) at user_socket.c:287
#4  0x00007f4527a6fd76 in usrsctp_close (so=0x7f44ef6cf380) at user_socket.c:2020

Which locks are the Thread 1607 and Thread 811 waiting for? Which thread owns the lock?

Sorry, for the late answer.
I have looked through the code. According to the code:
Thread 1607: locks TCB -> INP_READ -> unlocks TBC -> unlocks INP_READ -> locks TCB (after that, nothing happens)
Thread 811: locks TCB in sctp_send_abort_tcb -> locks TCB_SEND (after that, nothing happens)
I cannot find the place in which Thread 1607 locks TCB_SEND, but it is highly likely that Thread 1607 holds TCB_SEND.

I'm working on removing the TCP_SEND lock (for reasons other than avoiding a deadlock). Let me finish this and it would be great if you could test, if the problem persists.