luvit/lit

coro-http.request hangs on closed connection

SinisterRectus opened this issue · 14 comments

On rare occasions, requests infinitely hang. It's somewhat difficult to debug the coro libs because a lot of functionality is hidden in closures, but I think I've narrowed down the issue. I'm documenting what I have here until I or someone else can discover more about this.

  • Receive connection: close instead of connection: keep-alive in an HTTP response.

In http-codec:

  • keepAlive becomes false
      elseif lowerKey == "connection" then
        head.keepAlive = lower(value) == "keep-alive"
  • mode becomes decodeRaw
    if head.keepAlive and (not (chunkedEncoding or (contentLength and contentLength > 0)))
       or (head.method == "GET" or head.method == "HEAD") then
      mode = decodeEmpty
    elseif chunkedEncoding then
      mode = decodeChunked
    elseif contentLength then
      bytesLeft = contentLength
      mode = decodeCounted
    elseif not head.keepAlive then
      mode = decodeRaw
    end
  function decodeRaw(chunk, index)
    if #chunk < index then return end
    return sub(chunk, index)
  end

In coro-http:

  • read returns nil for item
  body = {}
  if req.method == "HEAD" then
    connection.reset()
  else
    while true do
      local item = read()
      if not item then
        res.keepAlive = false
        break
      end
      if #item == 0 then
        break
      end
      body[#body + 1] = item
    end
  end
  • keepAlive is false, so the connection is closed by calling write with no arguments, but it never returns.
  if res.keepAlive then
    saveConnection(connection)
  else
    write()
  end

The following reproduces the issue:

local weblit = require('weblit')

weblit.app.bind {host = '127.0.0.1', port = 1337}

.use(require('weblit-auto-headers'))

.route({
    method = "DELETE",
    path = "/foo",
}, function (req, res, go)
    if req.query and req.query.q == '1' then
        res.headers['Connection'] = 'Close'
    end
    res.code = 204
    res.body = nil
end)

.start()

local http = require('coro-http')

coroutine.wrap(function()
    p('req1', http.request("DELETE", "http://127.0.0.1:1337/foo"))
    p('req2', http.request("DELETE", "http://127.0.0.1:1337/foo?q=1"))
end)()

The same issue occurs; write never returns here:

  if res.keepAlive then
    saveConnection(connection)
  else
    write()
  end

The trace after calling write is:

coro-wrapper.lua:137
coro-wrapper.lua:138
coro-channel.lua:124
coro-channel.lua:130
coro-channel.lua:131
coro-channel.lua:132
coro-channel.lua:39
coro-channel.lua:40
coro-channel.lua:23
coro-channel.lua:24
coro-channel.lua:25
coro-channel.lua:26
coro-channel.lua:27
coro-channel.lua:74
coro-channel.lua:78
coro-channel.lua:27
coro-channel.lua:31
coro-channel.lua:32
coro-channel.lua:34
coro-channel.lua:133
coro-channel.lua:117
coro-channel.lua:120
coro-channel.lua:133
coro-channel.lua:134
coro-channel.lua:71
coro-channel.lua:89
coro-channel.lua:39
coro-channel.lua:40
coro-channel.lua:23
coro-channel.lua:81
coro-channel.lua:85
coro-channel.lua:86
coro-channel.lua:87
coro-channel.lua:88
coro-channel.lua:58
coro-channel.lua:64
coro-channel.lua:65
coro-channel.lua:66
coro-channel.lua:67
coro-channel.lua:68
coro-channel.lua:71
coro-channel.lua:89
coro-channel.lua:39
coro-channel.lua:40
coro-channel.lua:23
coro-channel.lua:24
coro-channel.lua:25
coro-channel.lua:26
coro-channel.lua:27
coro-channel.lua:74
coro-channel.lua:78
coro-channel.lua:27
coro-channel.lua:31
coro-channel.lua:32
coro-channel.lua:34

Looks like it craps out after trying to close the socket.

A quick fix for this is to change

  function decodeRaw(chunk, index)
    if #chunk < index then return end
    return sub(chunk, index)
  end

to

  function decodeRaw(chunk, index)
    if #chunk < index then return "", index end
    return sub(chunk, index)
  end

I don't know if that's the correct fix, though.

Here's the full strack trace for the request.

http.txt

Additionally, I'm not sure why GET is included in (head.method == "GET" or head.method == "HEAD"). Only HEAD is guaranteed to not contain a body.

Any idea what situation causes the chunk to be smaller than the index? I'm sure I put that check in there because I ran into it somewhere and it was breaking stuff.

Returning an empty string signals the end of the body. We need to make sure we don't do this if there is more body coming. That's why I'd like to understand what causes this situation in the first place.

Good question. The index is the chunk length + 1. The coro-channel dispatcher doesn't show anything unusual.

'<-'    'HTTP/1.1 204 No Content\r\nServer: false\r\nDate: Thu, 14 Sep 2017 14:19:05 GMT\r\nConnection: Keep-Alive\r\n\r\n'
'<-'    nil
'<-'    'HTTP/1.1 204 No Content\r\nConnection: Close\r\nServer: false\r\nDate: Thu, 14 Sep 2017 14:19:05 GMT\r\n\r\n'
'<-'    nil

The coro-wrapper goes into the decoder with

'HTTP/1.1 204 No Content\r\nServer: false\r\nDate: Thu, 14 Sep 2017 14:26:04 GMT\r\nConnection: Keep-Alive\r\n\r\n'     1
'HTTP/1.1 204 No Content\r\nServer: false\r\nDate: Thu, 14 Sep 2017 14:26:04 GMT\r\nConnection: Keep-Alive\r\n\r\n'     104
'HTTP/1.1 204 No Content\r\nServer: false\r\nDate: Thu, 14 Sep 2017 14:26:04 GMT\r\nConnection: Keep-Alive\r\n\r\n'     104
'HTTP/1.1 204 No Content\r\nConnection: Close\r\nServer: false\r\nDate: Thu, 14 Sep 2017 14:26:04 GMT\r\n\r\n'  1
'HTTP/1.1 204 No Content\r\nConnection: Close\r\nServer: false\r\nDate: Thu, 14 Sep 2017 14:26:04 GMT\r\n\r\n'  99

and comes out with

{ { 'Server', 'false' }, { 'Date', 'Thu, 14 Sep 2017 14:27:10 GMT' }, { 'Connection', 'Keep-Alive' }, code = 204, reason = 'No Content', version = 1.1, keepAlive = true }      104
''      104
nil     nil
{ { 'Connection', 'Close' }, { 'Server', 'false' }, { 'Date', 'Thu, 14 Sep 2017 14:27:10 GMT' }, code = 204, reason = 'No Content', version = 1.1, keepAlive = false }      99
nil     nil

I'm not sure why there are 3 reads in the first request and 2 in the second.

It looks like the buffer is small enough to be decoded in one pass. What happens comes down to

    if head.keepAlive and (not (chunkedEncoding or (contentLength and contentLength > 0)))
       or (head.method == "GET" or head.method == "HEAD") then
      mode = decodeEmpty
    elseif chunkedEncoding then
      mode = decodeChunked
    elseif contentLength then
      bytesLeft = contentLength
      mode = decodeCounted
    elseif not head.keepAlive then
      mode = decodeRaw
    end

When keepAlive is true (and the other conditions are met), an empty string is returned by decodeEmpty. Otherwise, in this case, nil is returned by decodeRaw since the end of buffer is reached; there is no body after the headers on the DELETE request.

Hmm, I guess it's length + 1 because the end index is inclusive and equals length in lua's 1-based indexes.

It seems to me that that the client is trying to read a body that does not exist. Interestingly, if I change res.body = nil to res.body = '' in the weblit server example, the client request is successful. I suppose the empty string is made meaningful by the extra \r\n.

Should a special case be added so that decodeEmpty is called for a 204 response?
Should http-codec be handling contentLength == 0?

So the idea is that the decoder will always emit a table for the head, followed by zero or more non-empty body chunks, followed by exactly one empty string to signify the end of body. After that the cycle repeats so that a single TCP connection can be reused for multiple http requests.

Of course, if the connection gets upgraded (via websockets), then the http-codec gets disconnected entirely and it's internal state GCed.

Right... but that doesn't happen. Again, coro-http.request never returns.

Any thoughts on this? I cannot figure out how this is breaking. Debugging shows that the socket closes, but instead of bubbling back up to coro-http, it just jumps right to the Luvit exit sequence.

...
coro-http.lua,158
coro-http.lua,159
coro-http.lua,160
coro-http.lua,169
coro-http.lua,172 -- last call in coro-http is `write()`
coro-wrapper.lua,137
coro-wrapper.lua,138
coro-channel.lua,124
coro-channel.lua,130
coro-channel.lua,131
coro-channel.lua,132
coro-channel.lua,39
coro-channel.lua,40
coro-channel.lua,23
coro-channel.lua,24
coro-channel.lua,25
coro-channel.lua,26
coro-channel.lua,27
coro-channel.lua,74
coro-channel.lua,78
coro-channel.lua,27
coro-channel.lua,31
coro-channel.lua,32
coro-channel.lua,34
coro-channel.lua,133
coro-channel.lua,117
coro-channel.lua,120
coro-channel.lua,133
coro-channel.lua,134
coro-channel.lua,71
coro-channel.lua,89
coro-channel.lua,39
coro-channel.lua,40
coro-channel.lua,23
bundle:init.lua,53
bundle:init.lua,55
bundle:init.lua,57
...

I suspect some coroutine issue, but I do not know enough about the expected behavior of the libraries to debug this.

Again, here is an example that reproduces the issue:

local weblit = require('weblit')

weblit.app.bind {host = '127.0.0.1', port = 1337}

.use(require('weblit-auto-headers'))

.route({
    method = "DELETE",
    path = "/foo",
}, function (req, res)
    if req.query and req.query.q == '1' then
        res.headers['Connection'] = 'Close'
    end
    res.code = 204
    res.body = nil
end)

.start()
local http = require('coro-http')

coroutine.wrap(function()

	debug.sethook(function(_, line)
		local source = debug.getinfo(2, 'S').source:match('([^/\\]+)$')
		io.write(source, ',', line, '\n') -- luvit print fails after a certain point
	end, 'l')

	-- p(http.request("DELETE", "http://127.0.0.1:1337/foo")) -- good request
	p(http.request("DELETE", "http://127.0.0.1:1337/foo?q=1")) -- bad request

end)()

Also interesting: If you try to print debug info with print, it fails after a certain point. io.write continues to work.

Okay, I've just managed to isolate the issue to an extent.

In coro-channel:

local function wait()
  local thread = coroutine.running()
  return function (err)
    assert(coroutine.resume(thread, err))
  end
end

if chunk == nil then
  closer.written = true
  closer.check()
  socket:shutdown(wait())
  return coroutine.yield()
end

The coroutine seems to never be resumed on socket shutdown; the wait closure is never called.

I made a PR to address the handling of the shutdown. In this case, the specific error is EPIPE: broken pipe.