droyo/styx

Race condition - styx server occasionally responds to messages with a later message's tag number

marzhall opened this issue · 2 comments

Hi!

In using styx to work on a message queue, I've come across a situation in which Tag numbers are used erroneously that seems to happen when a number of different processes are polling on a file from one session.

The error

Using the logging, the error looks like this:

→ 001 Tread fid=3 offset=58 count=8110
→ 002 Tstat fid=4
← 002 Rstat type=0 dev=0 qid="type=0 ver=0 path=3" mode=755 atime=1573318800 mtime=1573318800 length=0 name=".kyuu_ctl" uid="
← 002 Rread count=0
→ 002 Tread fid=1 offset=0 count=8168
→ 002 Tclunk fid=1

Note the Tread with tag 001 is being responded to with a Rread with tag 002 - which seems to happen after a Tstat comes in with Tag 002 and runs between the Tread and the Rread.

The above offset results in an error in one of two ways:

  • When using 9pfuse, for a time the offsets will continue - e.g., 9pfuse sends a message with tag 002, and styx returns with 003 - until Styx finally gets fed up and throws an error stating "no one waiting on tag ".
  • When using mount -t 9p, styx throws an error "Client has reused a tag.:

That said, as it's a timing issue, I think it may be possible for it to show both behaviours on either 9p back-end.

Example

The above scenario was generated by running this repo, which uses the styx library, and then having two bash processes constantly cating a file exposed by the service:

$: kyuuD -D -v &
$: 9pfuse -D 127.0.0.1:5640 ~/kyuus
$: while true; do cat ~/kyuus/.kyuu_ctl; done &
$: while true; do cat ~/kyuus/.kyuu_ctl; done &

Almost instantly, you should get an error "9pfuse: no one waiting on tag ."

Notes:

I did some digging, and created a branch with something that makes things far more stable, which I show in the Pull request here: #19

My hunch is that this is related to the fact that a goroutine is spun off to respond with an Rread, and that goroutine isn't getting done in time before session.Next is called. I think this may be a problem because of your statement here:

// Next waits for the next Request for a 9P session. The next request for
// the session can be accessed via the Request method if and only if Next
// returns true. Any previous messages retrieved for the session should not
// be modified or responded to after Next is called; if they have not been
// answered, the styx package will send default responses for them. The
// default response for a message can be found in the comments for each
// message type. Next returns false if the session has ended or there was
// an error receiving the next Request.

If there's some funky memory-management going on as part of .Next, it may explain why - when a message is handled in between a Tread and an Rread - the Rread's tag number gets mangled. But, I haven't dug that deeply into that part of the code. Last, #16 may be related, as it seems to be a race relating to reads, if my skimming was correct.

Thanks!

droyo commented

Thank you for the detailed report, and sorry about taking so long to look at this. After reading this, the code, and #16, this is almost certainly the same issue as #16 .

droyo commented

I'm going to assume #22 resolves this issue. Please reopen if not.