New Simple Easily Reproducable Catastrophic TCP Stack Failure Results in Low Heap Crash shown with WiFiServer, Latency > 100ms, and Larger Send Bytes
CircuitSerialKiller opened this issue · 17 comments
Basic Infos
Hardware
Hardware: ESP-12E, WeMos D1 Mini
Core Version: Latest github version
Description
I have to use the newer core on github for the
MEMP_NUM_TCP_PCB_TIME_WAIT = 5
to prevent crashing with frequent HTTP connections. (Issue 2767).
After switching to the latest github version of the ESP8266 core from version 2.3.0, I found that when accessing my HTTP server from VPN, or via the outside work with higher latency connections, HTTP responses from the ESP8266 will not make it to the client, and the ESP8266 will eventually run out of free heap until it fails. I have found this occurs only when the response is beyond a certain number of bytes in length as well, though I haven't determined the exact number of bytes. > 2048 bytes definitely demonstrates this issue. I tested tested via telnet direct to port 80 as well, and issuing HTTP commands. This does not occur with the 2.3.0 release.
I've modified the WiFiWebServer.ino file to easily demonstrate this issue. Just setup a NAT to your ESP8266 IP address with the appropriate ports, and from a remote site, cell phone, etc, try accessing the following URL repeatedly while watching the serial console. You'll see the free heap steadily decrease and never recover!
http://a.b.c.d/gpio/1
Where a.b.c.d is your outside IP address.
Settings in IDE
Module: NodeMCU 0.9
Flash Size: 4MB
CPU Frequency: 80Mhz
Flash Mode: qio
Flash Frequency: 40Mhz
Upload Using: SERIAL
Reset Method: nodemcu
Sketch
/*
* This sketch demonstrates how to set up a simple HTTP-like server.
* The server will set a GPIO pin depending on the request
* http://server_ip/gpio/0 will set the GPIO2 low,
* http://server_ip/gpio/1 will set the GPIO2 high
* server_ip is the IP address of the ESP8266 module, will be
* printed to Serial when the module is connected.
*/
//WiFiWebServer.ino
#include <ESP8266WiFi.h>
const char* ssid = "GoogCorp";
const char* password = "345SpearSt";
//********************Added for Crash Test********************
unsigned long connectionTimeout = 4000;
extern "C" {
#include "user_interface.h"
//needed for free heap check
}
unsigned long LastHeapReport;
//************************************************************
// Create an instance of the server
// specify the port to listen on as an argument
WiFiServer server(80);
void setup() {
Serial.begin(115200);
delay(10);
// prepare GPIO2
pinMode(2, OUTPUT);
digitalWrite(2, 0);
// Connect to WiFi network
Serial.println();
Serial.println();
Serial.print("Connecting to ");
Serial.println(ssid);
WiFi.begin(ssid, password);
while (WiFi.status() != WL_CONNECTED) {
delay(500);
Serial.print(".");
}
Serial.println("");
Serial.println("WiFi connected");
// Start the server
server.begin();
Serial.println("Server started");
// Print the IP address
Serial.println(WiFi.localIP());
}
void loop() {
//********************Added for Crash Test********************
//Display free heap every ~5s
if ((unsigned long) (millis() - LastHeapReport) > 5000) {
Serial.print("Free heap:");
Serial.println(system_get_free_heap_size());
LastHeapReport = millis();
}
//************************************************************
// Check if a client has connected
WiFiClient client = server.available();
if (!client) {
return;
}
// Wait until the client sends some data
Serial.println("new client");
//********************Modified for Crash Test********************
//Added connection timeout
unsigned long connectionStartTime = millis();
//High latency connections just hang here if no timeout
while(!client.available()){
delay(1);
if ((unsigned long) (millis() - connectionStartTime) > connectionTimeout) {
Serial.println("Connection Timeout");
client.stop();
return;
}
}
//************************************************************
// Read the first line of the request
String req = client.readStringUntil('\r');
Serial.println(req);
client.flush();
// Match the request
int val;
if (req.indexOf("/gpio/0") != -1)
val = 0;
else if (req.indexOf("/gpio/1") != -1)
val = 1;
else {
Serial.println("invalid request");
client.stop();
return;
}
// Set GPIO2 according to the request
digitalWrite(2, val);
client.flush();
//********************Modified for Crash Test********************
//Extended the response out and additional 2048 bytes
// Prepare the response
String s = "HTTP/1.1 200 OK\r\nContent-Type: text/html\r\n\r\n<!DOCTYPE HTML>\r\n<html>\r\nGPIO is now ";
s += (val)?"high":"low";
s += "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA";
s += "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA";
s += "</html>\n";
//************************************************************
// Send the response to the client
client.print(s);
delay(1);
Serial.println("Client disonnected");
// The client will actually be disconnected
// when the function returns and 'client' object is detroyed
}
Debug Messages
.........
WiFi connected
Server started
IP Redacted
Free heap:43848
Free heap:43848
Free heap:43088
new client
GET /gpio/1 HTTP/1.1
Client disonnected
Free heap:39680
Free heap:42880
Free heap:42880
new client
GET /gpio/1 HTTP/1.1
Client disonnected
Free heap:42696
new client
GET /gpio/1 HTTP/1.1
Client disonnected
Free heap:42512
Free heap:42512
new client
GET / HTTP/1.1
invalid request
new client
GET / HTTP/1.1
invalid request
new client
GET / HTTP/1.1
invalid request
new client
GET / HTTP/1.1
invalid request
new client
GET / HTTP/1.1
invalid request
new client
GET / HTTP/1.1
invalid request
new client
GET / HTTP/1.1
invalid request
new client
GET / HTTP/1.1
invalid request
new client
GET / HTTP/1.1
invalid request
new client
GET / HTTP/1.1
invalid request
Free heap:42144
Free heap:42144
Free heap:42144
Free heap:42144
Free heap:42144
Free heap:42144
Free heap:42144
Free heap:42144
Free heap:42144
Free heap:42144
Free heap:42144
Free heap:42144
new client
GET /gpio/1 HTTP/1.1
Client disonnected
Free heap:42144
Free heap:42144
Free heap:42144
Free heap:42144
Free heap:42144
Free heap:42144
new client
Connection Timeout
new client
GET /gpio/1 HTTP/1.1
Client disonnected
Free heap:38696
Free heap:38696
Free heap:38696
Free heap:38696
Free heap:38696
Free heap:38696
Free heap:38880
Free heap:39248
Free heap:39248
new client
GET /gpio/1 HTTP/1.1
Client disonnected
new client
Connection Timeout
Free heap:35472
Free heap:35616
Free heap:35616
Free heap:35616
new client
GET /gpio/1 HTTP/1.1
Client disonnected
Free heap:32168
Free heap:32168
Free heap:32168
Free heap:32168
Free heap:32168
Free heap:32352
Free heap:32352
Free heap:32352
Free heap:32352
new client
GET /gpio/1 HTTP/1.1
Client disonnected
new client
Connection Timeout
Free heap:28904
Free heap:29088
Free heap:29088
Free heap:29088
Free heap:29088
Free heap:29088
Free heap:29088
Free heap:29088
new client
GET /gpio/1 HTTP/1.1
Client disonnected
Free heap:25640
Free heap:25640
Free heap:25640
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
new client
GET /gpio/1 HTTP/1.1
Client disonnected
Free heap:22376
Free heap:22376
Free heap:22376
Free heap:22376
Free heap:22376
Free heap:22376
Free heap:22376
Free heap:22376
Free heap:22376
Free heap:22376
Free heap:22376
Free heap:22376
Free heap:22376
Free heap:22376
Free heap:22376
Free heap:22376
Free heap:22376
Free heap:22376
Free heap:22376
Free heap:22376
Free heap:22376
Free heap:22376
Free heap:22376
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
Free heap:25824
{d
I see only this commit to have relevant changes. try rolling them back and see if that changes your results
I have tried reproducing this last night, but seemingly my latency was not big enough. @CircuitSerialKiller any chance you can help narrowing down on the offending change using git bisect
?
Yes I can. I'm also going to grab a Wireshark capture. I should have a few hours this weekend to dedicate to this.
Thanks @igrr
Hi @igrr and @joelucid
The commit that introduced this issue is:
joelucid committed with igrr ClientConnection uses too much heap when streaming files #2871 (#2874)
The commit before this,
ivankravets committed with igrr Update @PlatformIO links
does not have this issue.
Sorry for the long delay before I could dedicate a few hours on a weekend to commit to tracking this down.
Hello,
I did try this with lwip2: it works with no bug.
With lwip1.4 it does not. I suspect that the bissected commit makes things work better, so lwip1.4 has too much data to deal with and it just does not work with too much stuffed buffers. This is the reason why I wanted to update lwip in the first place.
Here is the log of the sketch above, with a display every seconds, and available heap size.
Watch time:
WiFi connected
Server started
Free heap:42456 was:42456 diff:0 clients=0 time=1
[...]
Free heap:42456 was:42456 diff:0 clients=0 time=7
new client
GET /gpio/3 HTTP/1.1
invalid request
Free heap:42264 was:42264 diff:0 clients=1 time=8
pm open,type:2 0
Free heap:41504 was:41504 diff:0 clients=1 time=9
Free heap:41504 was:41504 diff:0 clients=1 time=10
Free heap:41504 was:41504 diff:0 clients=1 time=11
Free heap:41504 was:41504 diff:0 clients=1 time=12
Free heap:41504 was:41504 diff:0 clients=1 time=13
new client
GET /gpio/1 HTTP/1.1
Client disonnected
Free heap:41336 was:41504 diff:-168 clients=2 time=14
[...]
Free heap:41336 was:41504 diff:-168 clients=2 time=24
new client
GET /gpio/1 HTTP/1.1
Client disonnected
[...]
lots of requests from now
Free heap:4376 was:41504 diff:-37128 clients=214 time=35
[...]
Client disonnected
Free heap:2864 was:41504 diff:-38640 clients=232 time=38
Free heap:2864 was:41504 diff:-38640 clients=232 time=39
Free heap:2864 was:41504 diff:-38640 clients=232 time=40
Free heap:2864 was:41504 diff:-38640 clients=232 time=41
[...]
lwip starts to timeout something
Free heap:2864 was:41504 diff:-38640 clients=232 time=121
Free heap:3032 was:41504 diff:-38472 clients=232 time=122
[...]
Free heap:3200 was:41504 diff:-38304 clients=232 time=128
[...]
Free heap:19328 was:41504 diff:-22176 clients=232 time=143
[...]
Free heap:41336 was:41504 diff:-168 clients=232 time=150
Free heap:41504 was:41504 diff:0 clients=232 time=151
so about ~2min after burst requests, memory is released back by lwip timers.
I am not pushing to lwip2 in v2.4.0, I still think this is too early.
But it is worth saying that this very bug will not be easily solved using lwip1.4.
Edit: in the meantime, same sketch with lwip1.4 did not release memory back after 2368seconds (~40mn).
I witnessed this same issue with the webserver, and was forced to move to the asyncwebserver. I wish I had seen this before, I would have tested lwip2 instead.
@d-a-v I'm going a bit crazy with the cleanup, but I think I saw a PR with lwip2. Or was it a branch? Anyways, is that up to date? I may just try testing it anyways.
@d-a-v thanks, I will get to this right after I finish my current pendings for this repo.
Hi, this problem wasn't introduced with an LWIP change. This problem still exists in the current Arduino Core with LWIP v2.1.0-14-g33f234f.
The problem was introduced in the January 31st 2017 commit mentioned above with the chunking changes to ClientContext.h. If I set _write_chunk_size to 2920, the problem is gone.
Without a deep dive into the ESP8266 core and following tcp_write, I can't tell you exactly why. However, I believe chunking should be left out of the picture by default at the moment. @joelucid in #2871 mentions throughput as a justification for this change being acceptable, but latency and waiting for acknowledgements has a major effect on throughput when sending small chunks. I suspect reducing TCP writes to 256 byte blocks are creating a problem elsewhere in the code with latency since cumulative acknowledgments to all those small blocks will take quite some time.
@CircuitSerialKiller Yes it still does not work as stated above. It says:
- the commit on january was efficient (it is not an lwip change, but a WiFi* buffer management improvement)
- it was too efficient for lwip1.4
- current implementation of lwip1.4 does not suffer beeing shaked too much
- updating to lwip2 seems to solve the problem
However once - let's be imaginative enough - lwip2 is merged, I guess the chunk size will be reconsidered because it will have no impact on lwip stability. So it could become configurable for a bandwidth purpose (and TCP_MSS too). There will be a tradeoff between bandwidth and free HEAP.
@CircuitSerialKiller @d-a-v how does release 2.4.0 work out? I would think no trouble with lwip MSS=536.
What is the status here?
I suspect reducing TCP writes to 256 byte blocks are creating a problem elsewhere in the code with latency since cumulative acknowledgments to all those small blocks will take quite some time.
A year after I can tell you that these 256B chunks are not impacting latency (or very lightly). tcp_write()
itself does not send packets but only bufferizes data so this is not impacting acks. Anyway those chunks are removed in current git head.
If you can confirm the OP issue is solved, please close.
Closing due to age and lack of response.