Race condition - styx server occasionally responds to messages with a later message's tag number
marzhall opened this issue · 2 comments
Hi!
In using styx to work on a message queue, I've come across a situation in which Tag numbers are used erroneously that seems to happen when a number of different processes are polling on a file from one session.
The error
Using the logging, the error looks like this:
→ 001 Tread fid=3 offset=58 count=8110
→ 002 Tstat fid=4
← 002 Rstat type=0 dev=0 qid="type=0 ver=0 path=3" mode=755 atime=1573318800 mtime=1573318800 length=0 name=".kyuu_ctl" uid="
← 002 Rread count=0
→ 002 Tread fid=1 offset=0 count=8168
→ 002 Tclunk fid=1
Note the Tread
with tag 001 is being responded to with a Rread
with tag 002 - which seems to happen after a Tstat comes in with Tag 002 and runs between the Tread
and the Rread
.
The above offset results in an error in one of two ways:
- When using 9pfuse, for a time the offsets will continue - e.g., 9pfuse sends a message with tag 002, and styx returns with 003 - until Styx finally gets fed up and throws an error stating "no one waiting on tag ".
- When using
mount -t 9p
, styx throws an error "Client has reused a tag.:
That said, as it's a timing issue, I think it may be possible for it to show both behaviours on either 9p back-end.
Example
The above scenario was generated by running this repo, which uses the styx library, and then having two bash processes constantly cat
ing a file exposed by the service:
$: kyuuD -D -v &
$: 9pfuse -D 127.0.0.1:5640 ~/kyuus
$: while true; do cat ~/kyuus/.kyuu_ctl; done &
$: while true; do cat ~/kyuus/.kyuu_ctl; done &
Almost instantly, you should get an error "9pfuse: no one waiting on tag ."
Notes:
I did some digging, and created a branch with something that makes things far more stable, which I show in the Pull request here: #19
My hunch is that this is related to the fact that a goroutine is spun off to respond with an Rread
, and that goroutine isn't getting done in time before session.Next
is called. I think this may be a problem because of your statement here:
// Next waits for the next Request for a 9P session. The next request for
// the session can be accessed via the Request method if and only if Next
// returns true. Any previous messages retrieved for the session should not
// be modified or responded to after Next is called; if they have not been
// answered, the styx package will send default responses for them. The
// default response for a message can be found in the comments for each
// message type. Next returns false if the session has ended or there was
// an error receiving the next Request.
If there's some funky memory-management going on as part of .Next, it may explain why - when a message is handled in between a Tread and an Rread - the Rread's tag number gets mangled. But, I haven't dug that deeply into that part of the code. Last, #16 may be related, as it seems to be a race relating to reads, if my skimming was correct.
Thanks!