riemann/riemann

websockets stream broken in 0.3.4

dch opened this issue · 1 comments

dch commented

despite there being no source changes between git diff 1.3.2..HEAD -- src/riemann/transport/websockets.clj this is clearly broken, as riemann no longer sends the 101 Switching Protocols reponse first, and instead it seems to be interleaved with the actual data coming first -- I can only assume this is a dependency change in projects.clj somewhere.

On 0.3.2 this is what websockets does correctly:

sudo ngrep -qid ztagim5o45dhe4c -W byline -S 100000 -O /tmp/ws port 5556
ztagim5o45dhe4c: no IPv4 address assigned: Can't assign requested address
interface: ztagim5o45dhe4c
filter: ( port 5556 ) and ((ip || ip6) || (vlan && (ip || ip6)))
output: /tmp/ws

T fca2:927d:4de2:8e50:6c98::1:44319 -> fca2:927d:4d50:715c:f0b4::7:5556 [AP] #4
GET /index?subscribe=true&query=%28tagged+%22notification%22%29+and+%28state+%3D+%22critical%22%29 HTTP/1.1.
Host: riemann.zt.example.org:5556.
Connection: Upgrade.
Upgrade: WebSocket.
Sec-WebSocket-Key: blah
Sec-WebSocket-Version: 13.
.


T fca2:927d:4d50:715c:f0b4::7:5556 -> fca2:927d:4de2:8e50:6c98::1:44319 [AP] #6
HTTP/1.1 101 Switching Protocols.
Upgrade: websocket.
Connection: Upgrade.
Sec-Websocket-Accept: blah
Content-Length: 0.
Server: http-kit.
Date: Sat, 23 Nov 2019 21:57:43 GMT.
.


T fca2:927d:4d50:715c:f0b4::7:5556 -> fca2:927d:4de2:8e50:6c98::1:44319 [AP] #8
.~..{"host":"f02.example.org","service":"uptime/uptime","state":"critical","description":"Host f02.example.org, plugin uptime type uptime: Data source \"value\" is currently 1498.000000. That is below the failure threshold of 86400.000000.","metric":1498.0,"tags":["notification","collectd","FreeBSD","koans"],"time":"2019-11-23T21:57:29.862Z","ttl":180,"plugin":"uptime","type":"uptime","DataSource":"value"}.~..{"host":"i09.example.org","service":"curl-frontend/response_code","state":"critical","description":"Host i09.example.org, plugin curl (instance frontend) type response_code: Data source \"value\" is currently 400.000000. That is above the failure threshold of 299.000000.","metric":400.0,"tags":["notification","collectd","FreeBSD","koans"],"time":"2019-11-23T21:57:40.423Z","ttl":180,"plugin":"curl","plugin_instance":"frontend","type":"response_code","DataSource":"value"}

T fca2:927d:4d50:715c:f0b4::7:5556 -> fca2:927d:4de2:8e50:6c98::1:44319 [AP] #10
.~..{"host":"f01.example.org","service":"uptime/uptime","state":"critical","description":"Host f01.example.org, plugin uptime type uptime: Data source \"value\" is currently 1379.000000. That is below the failure threshold of 86400.000000.","metric":1379.0,"tags":["notification","collectd","FreeBSD","koans"],"time":"2019-11-23T21:57:39.296Z","ttl":180,"plugin":"uptime","type":"uptime","DataSource":"value"}

T fca2:927d:4d50:715c:f0b4::7:5556 -> fca2:927d:4de2:8e50:6c98::1:44319 [AP] #12
.~..{"host":"f02.example.org","service":"uptime/uptime","state":"critical","description":"Host f02.example.org, plugin uptime type uptime: Data source \"value\" is currently 1508.000000. That is below the failure threshold of 86400.000000.","metric":1508.0,"tags":["notification","collectd","FreeBSD","koans"],"time":"2019-11-23T21:57:39.842Z","ttl":180,"plugin":"uptime","type":"uptime","DataSource":"value"}

but on 0.3.5 at a1bbfc3

T fca2:927d:4de2:8e50:6c98::1:44437 -> fca2:927d:4d50:715c:f0b4::7:5556 [AP] #4
GET /index?subscribe=true&query=%28tagged+%22notification%22%29+and+%28state+%3D+%22critical%22%29 HTTP/1.1.
Host: riemann.zt.example.org:5556.
Connection: Upgrade.
Upgrade: WebSocket.
Sec-WebSocket-Key: blah
Sec-WebSocket-Version: 13.
.


T fca2:927d:4d50:715c:f0b4::7:5556 -> fca2:927d:4de2:8e50:6c98::1:44437 [AP] #6
.~..{"host":"i09.example.org","service":"uptime/uptime","state":"critical","description":"Host i09.example.org, plugin uptime type uptime: Data source \"value\" is currently 1933.000000. That is below the failure threshold of 86400.000000.","metric":1933.0,"tags":["notification","collectd","FreeBSD","koans"],"time":"2019-11-23T22:02:50.316Z","ttl":180,"plugin":"uptime","type":"uptime","DataSource":"value"}

T fca2:927d:4d50:715c:f0b4::7:5556 -> fca2:927d:4de2:8e50:6c98::1:44437 [AP] #8
.~..{"host":"f01.example.org","service":"uptime/uptime","state":"critical","description":"Host f01.example.org, plugin uptime type uptime: Data source \"value\" is currently 1689.000000. That is below the failure threshold of 86400.000000.","metric":1689.0,"tags":["notification","collectd","FreeBSD","koans"],"time":"2019-11-23T22:02:49.411Z","ttl":180,"plugin":"uptime","type":"uptime","DataSource":"value"}
HTTP/1.1 101 Switching Protocols. *********************************************************
Upgrade: websocket.
Connection: Upgrade.
Sec-Websocket-Accept: blah
Content-Length: 0.
Server: http-kit.
Date: Sat, 23 Nov 2019 22:02:59 GMT.
.

Above is from raw network packet capture, using a custom racket websockets cat tool, I have seen the same with the excellent https://github.com/vi/websocat

dch commented

I will endeavour to see if upstream httpkit reports this or not, this looks like a concurrency / race condition.

I see http-kit/http-kit#305 & http-kit/http-kit#375 that might be related, hard to tell.

Indeed, reverting to http-kit 2.2.0 resolves this issue. Was there a reason to move to 2.4.0-a3? This was done in ccd7068 & 514087b.