influxdata/line-protocol

Panic in syntax error handling

Closed this issue · 8 comments

I ran into this panic successively when I was piping a large amount of transient data via Telegraf using this PR: influxdata/telegraf#9685

There is probably some malformed data in the stream but it's difficult to pinpoint as it is a huge amount of transient data. At least some bounds checking and an error that would better show where the syntax issue is problem would be super useful.

2021-09-29T16:33:38Z I! http: panic serving 7.150.32.146:21370: runtime error: slice bounds out of range [216:215]
goroutine 26751 [running]:
net/http.(*conn).serve.func1()
        /tmp/golang-1.17/src/net/http/server.go:1801 +0xb9
panic({0x46982a0, 0xc03f633218})
        /tmp/golang-1.17/src/runtime/panic.go:1047 +0x266
github.com/influxdata/line-protocol/v2/lineprotocol.(*Decoder).syntaxErrorf(0xc024d39fee, 0xe, {0x4a63860, 0x3}, {0xc009601538, 0xc024d87101, 0xc009601558})
        /tmp/build/go/pkg/mod/github.com/influxdata/line-protocol/v2@v2.0.1-0.20210901123813-a12b539bfc76/lineprotocol/decoder.go:827 +0x1c8
github.com/influxdata/line-protocol/v2/lineprotocol.(*Decoder).NextField(0xc009601850)
        /tmp/build/go/pkg/mod/github.com/influxdata/line-protocol/v2@v2.0.1-0.20210901123813-a12b539bfc76/lineprotocol/decoder.go:412 +0x1d1
github.com/influxdata/telegraf/plugins/parsers/influx.nextMetric(0xc009601850, 0x65, 0x4ca87a0, 0x0)
        /tmp/build/git/telegraf-internal/plugins/parsers/influx/parser.go:278 +0x228
github.com/influxdata/telegraf/plugins/parsers/influx.(*StreamParser).Next(0xc009601828)
        /tmp/build/git/telegraf-internal/plugins/parsers/influx/parser.go:246 +0xe5
github.com/influxdata/telegraf/plugins/inputs/influxdb_listener.(*InfluxDBListener).handleWrite.func1({0x53b1f80, 0xc041a6fb20}, 0xc021971000)
        /tmp/build/git/telegraf-internal/plugins/inputs/influxdb_listener/influxdb_listener.go:316 +0x87a
net/http.HandlerFunc.ServeHTTP(0xc02192a9c5, {0x53b1f80, 0xc041a6fb20}, 0xc04d433db4984053)
        /tmp/golang-1.17/src/net/http/server.go:2046 +0x2f
github.com/influxdata/telegraf/internal.(*basicAuthHandler).ServeHTTP(0xc000b7c0a0, {0x53b1f80, 0xc041a6fb20}, 0x1)
        /tmp/build/git/telegraf-internal/internal/http.go:47 +0xf7
net/http.(*ServeMux).ServeHTTP(0xc041a6fb20, {0x53b1f80, 0xc041a6fb20}, 0xc021971000)
        /tmp/golang-1.17/src/net/http/server.go:2424 +0x149
github.com/influxdata/telegraf/plugins/inputs/influxdb_listener.(*InfluxDBListener).ServeHTTP(0xc000348300, {0x53b1f80, 0xc041a6fb20}, 0x6c3e80)
        /tmp/build/git/telegraf-internal/plugins/inputs/influxdb_listener/influxdb_listener.go:214 +0x68
net/http.serverHandler.ServeHTTP({0x5399d00}, {0x53b1f80, 0xc041a6fb20}, 0xc021971000)
        /tmp/golang-1.17/src/net/http/server.go:2878 +0x43b
net/http.(*conn).serve(0xc0347c0aa0, {0x53f2830, 0xc000c8e1b0})
        /tmp/golang-1.17/src/net/http/server.go:1929 +0xb08
created by net/http.(*Server).Serve
        /tmp/golang-1.17/src/net/http/server.go:3033 +0x4e8

@rogpeppe can you take a look at this? Thanks.

@oplehto I haven't found anything that stands out so far when inspecting the code (I've also been fuzzing the code for the last hour or so (~120 million execs so far) and haven't found any crashes), so it seems like you've found a nice edge case!

Could you try to reproduce the issue again, but first update to the following branch which I've added a bit of instrumentation to in order to get a better panic on error?

go get github.com/influxdata/line-protocol/v2@v2.2.1-0.20211001085601-62fcfd697adb

Thanks!

Update: please use this version instead:

go get github.com/influxdata/line-protocol/v2@v2.2.1-0.20211001090429-a14672f62d41

@oplehto Just to confirm you tested this with roger's branch above and are still receiving a panic?

FWIW I've been fuzzing the code continuously on a 16 core machine for over 180 hours now and I haven't found a crash yet.
So a reproducer from @oplehto would be super useful!

@rogpeppe I agree that it would be nice but the problem is that the incoming data stream is huge. I suspect that the problem is originating from occasional truncated data from embedded systems. The problem is that upgrading these embedded systems is not completely trivial.

I was using #9871 that has the correct parser version. I'll try to modify the code a bit to get a larger chunk of data.

Hi @oplehto - did you manage to make any progress getting a larger chunk of data?

Are you able to share any raw data with us that would help us to narrow this down?

I haven't seen this panic again in our live data stream since we moved all publishers from HTTP to UDP (I'm unable to share the raw pcap data due to its proprietary nature). Even without being able to repro this, it looks very likely that #51 resolves the issue however so I'm fine with closing this issue.