fnproject/fn

racy test

rdallman opened this issue · 12 comments

=== RUN   TestBasicConcurrentDetachedExecution
fatal error: concurrent map read and map write

goroutine 1182 [running]:
runtime.throw(0x1006393, 0x21)
	/usr/local/go/src/runtime/panic.go:617 +0x72 fp=0xc00379f8b0 sp=0xc00379f880 pc=0x439e12
runtime.mapaccess1_faststr(0xef6880, 0xc001b3ea80, 0xff1dd1, 0xe, 0x0)
	/usr/local/go/src/runtime/map_faststr.go:21 +0x469 fp=0xc00379f920 sp=0xc00379f8b0 pc=0x41e9f9
net/textproto.MIMEHeader.Get(0xc001b3ea80, 0xff1dd1, 0xe, 0x0, 0x30)
	/usr/local/go/src/net/textproto/header.go:35 +0x5d fp=0xc00379f958 sp=0xc00379f920 pc=0x6a8add
net/http.Header.Get(...)
	/usr/local/go/src/net/http/header.go:47
github.com/fnproject/fn/api/agent.logCallFinish(0x118be20, 0xc003790780, 0xc0036a4aa8, 0xc001b3ea80, 0xc0000000ca)
	/home/circleci/go/src/github.com/fnproject/fn/api/agent/runner_client.go:446 +0xa8 fp=0xc00379f9e8 sp=0xc00379f958 pc=0xbcb9e8
github.com/fnproject/fn/api/agent.receiveFromRunner(0x1174280, 0xc00372be30, 0x1183ba0, 0xc003701fb0, 0xff1c81, 0xe, 0x1183ae0, 0xc003770120, 0xc003790300)
	/home/circleci/go/src/github.com/fnproject/fn/api/agent/runner_client.go:403 +0x64e fp=0xc00379ff98 sp=0xc00379f9e8 pc=0xbcafde
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc00379ffa0 sp=0xc00379ff98 pc=0x4695c1
created by github.com/fnproject/fn/api/agent.(*gRPCRunner).TryExec
	/home/circleci/go/src/github.com/fnproject/fn/api/agent/runner_client.go:217 +0xbbd

https://circleci.com/gh/fnproject/fn/5549 while it lasts

I can't figure out if it's a race in tests or a race in the code. it looks awfully suspicious since it's detached and we write the headers early on that one, but the race appears in the lb, which seems funky.

oh, this looks legit. While LB is finishing off reading from Runner (the goroutine above), the runner_invoke.go code might be already writing headers. I noticed that most of these headers (syncResponseWriter or DetachedResponseWriter) are shallow copies of the original gin resp headers map.

thanks for details. yeah, I buy that. am having trouble reproducing locally. fix doesn't seem straightforward, for detached we probably shouldn't be handing over the same response writer that we're using to write back a response to the client, since we want to discard the response basically? for the sync writer, submit should return before it's used.

i've got a possible fix, working on it at least. i need to read the detached stuff again to be sure.

I don't think the issue is specific to detached mode. In both modes, a shallow copy of header is being made and while the headers are being set, the logger in above stack trace may be reading them. The logger merely needs status code from runner. There's no reason for it to concurrently read the entire response header map.

thanks, I follow and agree about the logger. from what I can tell, this means that TryExec itself needs to be handed a response writer that is safe for writing body and headers to? (at least, the issue itself does not seem to be from within TryExec or receiveFromRunner as they appear safe on their own, but not safe when TryExec is called multiple times with the same RunnerCall?)

edit: have trouble with words, at least, I'm finding it hard to spot how to do this in receiveFromRunner alone as that routine itself is writing or reading the headers at various spots, logger included. in any case, it seems that the issue itself is deeper than the logger, as there are other spots that the header map could be read/written, this one just cropped up in the logger.

resp writer that is safe is probably a good option. Otherwise, we'd need to:

  1. make sure in receiveFromRunner header.Set() (case *pb.RunnerMsg_ResultStart:) is not executed after TryExec() returns
  2. logger can read a local variable for the fn-http-status instead of concurrent access.

We can perhaps think of other options. IMHO, concurrent/safe resp writer itself is a red flag. runner_client could perhaps create headers/buffer and then move these and relinquish ownership when done. We can avoid unnecessary IO copy this way too.

agree about not doing concurrent safe respwriter, yea. that option sounds the best to me, I was trying to figure out where to do this. TryExec seems just fine. PlaceCall kind of stinks because every placer has to do it, TryRunner is another possibility. TryExec is implemented by any runner, whereas every placer calls TryRunner.

not sure i follow about moving w/o copying with the way the runnercall is done atm, if you want to have it at / explain, at least it sounds nicer than copying if we can figure out a way to avoid it.

w/o copy is going to require iface changes to bunch of places since this is abstracted by resp writer iface. :-( For short term, I think we can surgically fixup receiveFromRunner and TryExec().

no worries. the copy isn't too bad, I guess the sad part is we're copying in two places, but we can probably change that, too.

Unfortunately we had to revert the change to fix this issue as the solution proposed broke detach mode.