TCP close fail (IDFGH-3691)
mcerveny opened this issue · 18 comments
Hello.
There is problem with TCP connection close.
Problem description:
LWIP as TCP server and "nc" as TCP client. Test code:
# while true; do nc 192.168.2.247 3333 < file_size_7661.jpg ; done
"nc" generates 1024 bytes TCP segments by default.
With default config:
#define CONFIG_LWIP_TCP_WND_DEFAULT 5744
#define CONFIG_LWIP_TCP_RECVMBOX_SIZE 6
So, there is probability that there are 6x data packet (5x1024 + 1x493) in RECVMBOX and there is no room for "CLOSE" message in RECVMBOX (CONFIG_LWIP_TCP_RECVMBOX_SIZE is 6). FIN packet is acknowledged by LWIP (so client side stuck in FIN_WAIT_2) but LWIP stays in "ESTABLISHED" forever (lost "CLOSE" message). LWIP is blocked and waiting for new data in recv() !
Added some simple debug patch:
diff --git a/src/api/api_msg.c b/src/api/api_msg.c
index 24f177b9..7b72be33 100644
--- a/src/api/api_msg.c
+++ b/src/api/api_msg.c
@@ -339,9 +339,10 @@ recv_tcp(void *arg, struct tcp_pcb *pcb, struct pbuf *p, err_t err)
msg = LWIP_CONST_CAST(void *, &netconn_closed);
len = 0;
}
-
+ printf("XXXXZ %p %d\n", msg, len);
if (sys_mbox_trypost(&conn->recvmbox, msg) != ERR_OK) {
/* don't deallocate p: it is presented to us later again from tcp_fasttmr! */
+ printf("XXXXZ fail\n");
return ERR_MEM;
} else {
#if LWIP_SO_RCVBUF
Example of OK connection:
Output from console (debug from simple tcp server and added patch):
I (16207) app: Socket listening
I (16207) app_tcp: Socket accepted ip address: 192.168.1.167
XXXXZ 0x3ffde0e4 1024
XXXXZ 0x3ffceaa0 1024
XXXXZ 0x3ffcead0 1024
XXXXZ 0x3ffce8e4 1024
XXXXZ 0x3ffce914 1024
I (16217) app_tcp: Received 1024 bytes
I (16227) app_tcp: Received 4096 bytes
XXXXZ 0x3ffde0e4 1024
XXXXZ 0x3ffceaa0 1024
XXXXZ 0x3ffcead0 493
XXXXZ 0x3f40beac 0
I (16237) app_tcp: Received 1024 bytes
I (16237) app_tcp: Received 1517 bytes
I (16247) app_tcp: Connection closed - received 7661
Output from wireshark:
1433 1595323809.269575 192.168.1.167 192.168.2.247 TCP 74 39459 → 3333 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=1668935 TSecr=0 WS=4
1434 1595323809.271464 192.168.2.247 192.168.1.167 TCP 60 3333 → 39459 [SYN, ACK] Seq=0 Ack=1 Win=5744 Len=0 MSS=1440
1435 1595323809.271475 192.168.1.167 192.168.2.247 TCP 54 39459 → 3333 [ACK] Seq=1 Ack=1 Win=64800 Len=0
1436 1595323809.271541 192.168.1.167 192.168.2.247 TCP 1078 39459 → 3333 [PSH, ACK] Seq=1 Ack=1 Win=64800 Len=1024
1437 1595323809.271562 192.168.1.167 192.168.2.247 TCP 1078 39459 → 3333 [ACK] Seq=1025 Ack=1 Win=64800 Len=1024
1438 1595323809.271566 192.168.1.167 192.168.2.247 TCP 1078 39459 → 3333 [PSH, ACK] Seq=2049 Ack=1 Win=64800 Len=1024
1439 1595323809.271587 192.168.1.167 192.168.2.247 TCP 1078 39459 → 3333 [ACK] Seq=3073 Ack=1 Win=64800 Len=1024
1440 1595323809.271592 192.168.1.167 192.168.2.247 TCP 1078 39459 → 3333 [PSH, ACK] Seq=4097 Ack=1 Win=64800 Len=1024
1441 1595323809.275786 192.168.2.247 192.168.1.167 TCP 60 3333 → 39459 [ACK] Seq=1 Ack=2049 Win=3696 Len=0
1442 1595323809.277516 192.168.2.247 192.168.1.167 TCP 60 3333 → 39459 [ACK] Seq=1 Ack=4097 Win=1648 Len=0
1443 1595323809.284511 192.168.2.247 192.168.1.167 TCP 60 3333 → 39459 [ACK] Seq=1 Ack=5121 Win=5744 Len=0
1444 1595323809.284520 192.168.1.167 192.168.2.247 TCP 1078 39459 → 3333 [ACK] Seq=5121 Ack=1 Win=64800 Len=1024
1445 1595323809.284523 192.168.1.167 192.168.2.247 TCP 1078 39459 → 3333 [ACK] Seq=6145 Ack=1 Win=64800 Len=1024
1446 1595323809.284527 192.168.1.167 192.168.2.247 TCP 547 39459 → 3333 [FIN, ACK] Seq=7169 Ack=1 Win=64800 Len=493
1447 1595323809.292769 192.168.2.247 192.168.1.167 TCP 60 3333 → 39459 [ACK] Seq=1 Ack=7169 Win=3696 Len=0
1448 1595323809.296151 192.168.2.247 192.168.1.167 TCP 60 3333 → 39459 [ACK] Seq=1 Ack=7663 Win=3202 Len=0
1449 1595323809.300686 192.168.2.247 192.168.1.167 TCP 60 [TCP Window Update] 3333 → 39459 [ACK] Seq=1 Ack=7663 Win=5744 Len=0
1450 1595323809.329641 192.168.2.247 192.168.1.167 TCP 60 3333 → 39459 [FIN, ACK] Seq=1 Ack=7663 Win=5744 Len=0
1451 1595323809.329650 192.168.1.167 192.168.2.247 TCP 54 39459 → 3333 [ACK] Seq=7663 Ack=2 Win=64800 Len=0
Example of FAILED connection:
Output from console:
I (16267) app: Socket listening
XXXXZ 0x3ffde0e4 1024
I (16277) app_tcp: Socket accepted ip address: 192.168.1.167
XXXXZ 0x3ffceaa0 1024
XXXXZ 0x3ffcead0 1024
XXXXZ 0x3ffce8e4 1024
XXXXZ 0x3ffce914 1024
I (16287) app_tcp: Received 2048 bytes
XXXXZ 0x3ffceaec 1024
XXXXZ 0x3ffceaa0 1024
XXXXZ 0x3ffde0e4 493
XXXXZ 0x3f40beac 0
XXXXZ fail
I (16297) app_tcp: Received 5613 bytes
Output from wireshark:
1452 1595323809.334582 192.168.1.167 192.168.2.247 TCP 74 34142 → 3333 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 SACK_PERM=1 TSval=1668942 TSecr=0 WS=4
1453 1595323809.336422 192.168.2.247 192.168.1.167 TCP 60 3333 → 34142 [SYN, ACK] Seq=0 Ack=1 Win=5744 Len=0 MSS=1440
1454 1595323809.336439 192.168.1.167 192.168.2.247 TCP 54 34142 → 3333 [ACK] Seq=1 Ack=1 Win=64800 Len=0
1455 1595323809.336485 192.168.1.167 192.168.2.247 TCP 1078 34142 → 3333 [PSH, ACK] Seq=1 Ack=1 Win=64800 Len=1024
1456 1595323809.336499 192.168.1.167 192.168.2.247 TCP 1078 34142 → 3333 [ACK] Seq=1025 Ack=1 Win=64800 Len=1024
1457 1595323809.336501 192.168.1.167 192.168.2.247 TCP 1078 34142 → 3333 [PSH, ACK] Seq=2049 Ack=1 Win=64800 Len=1024
1458 1595323809.336512 192.168.1.167 192.168.2.247 TCP 1078 34142 → 3333 [ACK] Seq=3073 Ack=1 Win=64800 Len=1024
1459 1595323809.336514 192.168.1.167 192.168.2.247 TCP 1078 34142 → 3333 [PSH, ACK] Seq=4097 Ack=1 Win=64800 Len=1024
1460 1595323809.340092 192.168.2.247 192.168.1.167 TCP 60 3333 → 34142 [ACK] Seq=1 Ack=2049 Win=3696 Len=0
1461 1595323809.343812 192.168.2.247 192.168.1.167 TCP 60 3333 → 34142 [ACK] Seq=1 Ack=4097 Win=1648 Len=0
1462 1595323809.345874 192.168.2.247 192.168.1.167 TCP 60 3333 → 34142 [ACK] Seq=1 Ack=5121 Win=2672 Len=0
1463 1595323809.345883 192.168.1.167 192.168.2.247 TCP 1078 34142 → 3333 [ACK] Seq=5121 Ack=1 Win=64800 Len=1024
1464 1595323809.345885 192.168.1.167 192.168.2.247 TCP 1078 34142 → 3333 [ACK] Seq=6145 Ack=1 Win=64800 Len=1024
1465 1595323809.345890 192.168.1.167 192.168.2.247 TCP 547 34142 → 3333 [FIN, ACK] Seq=7169 Ack=1 Win=64800 Len=493
1466 1595323809.354268 192.168.2.247 192.168.1.167 TCP 60 3333 → 34142 [ACK] Seq=1 Ack=7169 Win=624 Len=0
1467 1595323809.358857 192.168.2.247 192.168.1.167 TCP 60 3333 → 34142 [ACK] Seq=1 Ack=7663 Win=130 Len=0
1468 1595323809.360081 192.168.2.247 192.168.1.167 TCP 60 [TCP Window Update] 3333 → 34142 [ACK] Seq=1 Ack=7663 Win=5744 Len=0
Increase CONFIG_LWIP_TCP_RECVMBOX_SIZE cannot help generally because TCP segment can have any size.
How can I reproduce the problem??
I repeat:
LWIP as TCP server and "nc" as TCP client. Test code:
# while true; do nc 192.168.2.247 3333 < file_size_7661.jpg ; done
If you are not able to construct simple LWIP TCP server:
static void tcp_server_task(void *pvParameters)
{
char addr_str[128];
struct sockaddr_in dest_addr = {
.sin_addr.s_addr = htonl(INADDR_ANY),
.sin_family = AF_INET,
.sin_port = htons(TCP_PORT),
};
int listen_sock = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
if (listen_sock < 0) {
ESP_LOGE(TAG_TCP, "Unable to create socket: errno %d", errno);
vTaskDelete(NULL);
return;
}
ESP_LOGI(TAG, "Socket created");
int err = bind(listen_sock, (struct sockaddr *)&dest_addr, sizeof(dest_addr));
if (err != 0) {
ESP_LOGE(TAG_TCP, "Socket unable to bind: errno %d", errno);
goto CLEAN_UP;
}
ESP_LOGI(TAG_TCP, "Socket bound, port %d", TCP_PORT);
err = listen(listen_sock, 1);
if (err != 0) {
ESP_LOGE(TAG_TCP, "Error occurred during listen: errno %d", errno);
goto CLEAN_UP;
}
while (1) {
ESP_LOGI(TAG_TCP, "Socket listening");
struct sockaddr_in source_addr;
uint addr_len = sizeof(source_addr);
int sock = accept(listen_sock, (struct sockaddr *)&source_addr, &addr_len);
if (sock < 0) {
ESP_LOGE(TAG_TCP, "Unable to accept connection: errno %d", errno);
break;
}
inet_ntoa_r(((struct sockaddr_in *)&source_addr)->sin_addr.s_addr, addr_str, sizeof(addr_str) - 1);
ESP_LOGI(TAG_TCP, "Socket accepted ip address: %s", addr_str);
uint8_t *data = jpeg;
int len;
do {
len = recv(sock, data, sizeof(jpeg) - (data-jpeg), 0);
if (len < 0) {
ESP_LOGE(TAG_TCP, "Error occurred during receiving: errno %d", errno);
} else if (len == 0) {
ESP_LOGI(TAG_TCP, "Connection closed - received %d", (data-jpeg));
} else {
ESP_LOGI(TAG_TCP, "Received %d bytes", len);
data += len;
}
} while (len > 0);
shutdown(sock, 0);
close(sock);
}
CLEAN_UP:
close(listen_sock);
vTaskDelete(NULL);
}
It fails randomly (blocked in recv()) in few hundred repetitions (few seconds).
hi @mcerveny
The test details are shown in the figure above, and the specific test steps are as follows:
Environment
- IDF version : master
- Build System: |idf.py
- Operating System: Windows
- Power Supply: USB
reproduce code
-
jpeg = heap_caps_malloc(400 * 1024, MALLOC_CAP_SPIRAM | MALLOC_CAP_8BIT); Allocate 400k memory from PSRAM
-
I've made a few changes to your code.
do {
len = recv(sock, data, 2000, 0); // Fixed number of bytes at a time
if (len < 0) {
ESP_LOGE(TAG_TCP, "Error occurred during receiving: errno %d", errno);
} else if (len == 0) {
ESP_LOGI(TAG_TCP, "Connection closed - received %d", (data-jpeg));
} else {
ESP_LOGI(TAG_TCP, "Received %d bytes", len); //delete this log,Because this log affects the received data
data += len;
}
} while (len > 0); -
while true; do nc 192.168.1.102 3333 < 11.jpg ; done // send 372k jpg
result
I suspect that your jpeg cache is not large enough to cause this problem.
hi @mcerveny
Do you have any questions? If not, I will close this issue. If you have any problems, please open this question at any time. Thank you
I repeat, the problem is not in "user buffers" or "logging output!" but in LWIP packet handling and LWIP internal buffers (RECVMBOX + CONFIG_LWIP_TCP_WND_DEFAULT). The problem is activated with last six packets (CONFIG_LWIP_TCP_RECVMBOX_SIZE) and it is irrelevant how many packet were precessed before.
Did you successfully test your patches with small repeated tcp transfers (like 7661 bytes) ?
This is critical bug for me and I avoid to use LWIP and/or ESP32 in my projects.
hi@mcerveny
What version of IDF do you use?I didn't reproduce this problem on the master
Yes, the patch "solution.TXT" works for me. The patch recovered 100 from 11948 tested connections.
hi @mcerveny
I think this solution can solve your problem, right?
Do you have any problems in the test?
hi @mcerveny
Do you have any questions? If not, I will close this issue. If you have any problems, please reopen. Thank you
Issue should be closed AFTER final patch release, testing and integration (with pull request acceptance or pushed by espressif @david-cermak ), This patch is WIP and it should be probably reworked.
@mcerveny Yes, the above patch was just a quick fix. We have an internal MR under evaluation that should resolve the issue. Please keep this open, it will close automatically upon publishing the actual fix.
PS: Thanks for reporting this issue and sharing the steps to reproduce!
So there is a bug in lwIP and you have a fix for it. Not you as the vendor gaining something from shipping lwIP, but the user has reported this bug to us (before it has been fixed, no update with the fix has been sent to us). I would greatly appreciate it if both the bug and the fix would have made it back to use by vendors benefiting from our work. By not doing that, you drive more and more open source developers towards GPL-like licenses!
Best Regards,
Simon Goldschmidt (current maintainer of lwIP)
no update with the fix has been sent to us
It was linked ( https://savannah.nongnu.org/bugs/?58797 ) to here from beginning to see details about bug, test case and solution. As I wrote to LwIP bug report I was not able to test it with LwIP upstream. I am not responsible that LwIP community ignore this reported bug for more than 1 year (not assigned, no status set...).
Best Regards,
Martin Cerveny
@goldsimon This was also posted as patch, too, linking the bug in lwIP upstream as well as this issue. There was even some discussion between us back in March, so I reworked the patch once or twice...
https://savannah.nongnu.org/patch/?10013