skvadrik/re2c

Guidance on how to use re2c for full-duplex command & response protocol

Closed this issue · 9 comments

I'm attempting to utilize re2c in the context of a full-duplex communication protocol. A client is sending commands to the server, which I'd like to lex with re2c and then parse. The client may send many commands before waiting for responses or it might send commands one at a time and wait for a response between each one.

The issue I'm having is the one demonstrated in a number of the examples: YYMAXFILL is large enough relative to the lexemes that at the time YYFILL(n) is invoked, there is often an entire lexeme's worth of unprocessed input in the buffer. In most of the examples, this is dealt with in one of two ways. Either an additional YYMAXFILL characters of data is added to the end of the input data or YYFILL(n) is disabled and YYPEEK is redefined to provide a sentinel character for bytes after the end of input.

However both of those strategies rely on being able to either add or fake garbage data after the end of the input. In the case of this protocol, the end of a command rarely corresponds with the end of input. Once the client sends a fully-formed command, it may chose to block indefinitely on a response from the server. For the server to form a response it must parse the request. To parse the request the lexer must emit the final lexeme in the command. But the lexer decided it needed more input, despite having the final lexeme in it's input buffer. So the whole system deadlocks.

I can understand the desire to increase efficiency by decreasing the number of bounds checks, but in this case it's prohibiting correct behavior. Based on what I've read of the documentation, I should switch to the "generic API" so that I can disable YYFILL and redefine YYPEEK to do unbuffered character reads from the input. I was just wondering if there was a way to apply a heuristic and achieve a compromise. Ideally, if there is enough input available, the bounds check could occur only every YYMAXFILL characters. However when there's not enough input data immediately available then the lexer would switch to character-by-character bounds checking with the premise being that after a few characters a lexeme will be complete, the server will send a response, and then the client will send a new request (or hang up) and there will be more data available to lex.

Further question. When combining --storable-state with --input custom and re2c:yyfill:enable = 0; to make a push-lexer that does unbuffered reads, it looks like calls to YYSETSTATE(n) are only generated in places where YYFILL(n) would have been made had it not been disabled. Since YYSKIP() is being used to do unbuffered reads and returns if data isn't immediately available (the outer program will select until data is ready), a call to YYSETSTATE(n) is needed or else the lexer will return to the wrong state. Is there an additional setting to generate more YYSETSTATE(n) calls, or should I not define YYSKIP() in this way when generating a push-lexer?

Hmm, it's an interesting use case, somewhat unusual.

Indeed, --storable-state is broken with any input model except the default one with YYFILL() and YYMAXFILL. Generic API (--input custom) has been added much later, and you are the first to try using it with --storable-state.

Can you post the protocol grammar, or even the whole lexer here? In some cases the turns out to be a bit easier on a closer look.

There is a new simplified way to check for the end of buffer that has been added recently and is in experimental state (it hasn't been released yet). In this new model checks are generated in each state, but they depend on the input character, so performance is not as bad as checking on each YYPEEK() or YYSKIP(). In return the programmer is relieved of all end-of-input headaches. It's not difficult to get this working with --storable-state (although the initial switch will have more branches than in YYFILL()/YYMAXFILL case). I've briefly tested it just now, and I'll try to push the necessary fix tomorrow (and send you an example of usage).

The grammar is pretty uninteresting as I haven't really even gotten around to converting it from flex. I'm still at the phase of finding out if re2c can do what I need it to or if I need a different tool. The protocol is ESMTP. Commands are (mostly) CRLF delimited text. As I mentioned, one or more commands can arrive before the client blocks on a response. Sometimes commands will get broken up over TCP packets and the server's built on non-blocking IO, hence the desire for a push-model lexer. Although most clients are other computers, the occasional human will connect over telnet and they're much slower to provide data. The hardest command to support might be BDAT nn CRLF, where the nn indicates a fixed-size block of binary data which I'll handle out of band of the lexer (another good reason to have a push-model lexer, preferably one with zero lookahead).

Below is modified from the examples, as a proof-of-concept. Currently doesn't work.

static enum status_t lex(lexer_t * lexer, uint8_t * words) {
  #define YYGETSTATE()  lexer->state
  #define YYSETSTATE(s) lexer->state = s
  #define YYPEEK() (lexer->c)
  #define YYSKIP() do { if (0 == fread(&lexer->c, 1, 1, stdin)) { return NEED_MORE_INPUT; } } while (0)
  #define YYBACKUP() do {} while (0)
  #define YYBACKUPCTX() do {} while (0)
  #define YYRESTORE() assert(false)
  #define YYRESTORECTX() assert(false)
  #define YYRESTORERAG(t) assert(false)
  #define YYSTAGP(t) assert(false)
  #define YYSTAGN(t) assert(false)
  YYSKIP();

  /*!re2c
      re2c:define:YYCTYPE  = char;
      re2c:variable:yych   = lexer->c;
      re2c:yyfill:enable = 0;

      *          { fprintf(stderr, "> unexpected character >%.1s<\n", &lexer->c); return FAIL; }
      [\x00]     { fprintf(stderr, "> Found EOF\n"); return OK; }
      "\r"?"\n"     { return C_CRLF; }
      "HELO"     { return C_HELO; }
      "EHLO"     { return C_EHLO; } 
      "QUIT"     { return C_QUIT; }
      "RSET"     { return C_RSET; }
      "DATA"     { return C_DATA; }
      "TURN"     { return C_TURN; }
      "HELP"     { return C_HELP; }
      "NOOP"     { return C_NOOP; }
      "STARTTLS" { return C_STARTTLS; }
      "EXAMPLE\nEXAMPLE" { return C_EXAMPLE; }
  */
}

If YYSKIP yields after the \n in EXAMPLE\n, it will end up back in state yy0 on the next invocation and will never finish matching C_EXAMPLE. (This is not a real use case, I'm just taking advantage of line buffering in the terminal to simulate TCP fragmentation).

Remember to throw the following early on in your main so that fread doesn't block.

fcntl(fileno(stdin), F_SETFL, fcntl(fileno(stdin), F_GETFL) | O_NONBLOCK);

And add the following in the handling of NEEDS_MORE_INPUT or all your fans will kick on :)

if (feof(stdin)) { return 0; }
fd_set fdset;
do {
  FD_ZERO(&fdset);
  FD_SET(fileno(stdin), &fdset);
  struct timespec timeout = (struct timespec) {.tv_sec = 5};
  pselect(fileno(stdin) + 1, &fdset, NULL, NULL, &timeout, NULL);
} while (!FD_ISSET(fileno(stdin), &fdset));

Focusing on correctness instead of speed, here's my hacky plan: Manipulate output of re2c in the following ways

  1. After each label, call YYSETSTATE(n) with that state (also patch dispatch table w/ new states)
  2. Before each in.yych = *++in.cur call, add a if (1 == (in.lim - in.cur)) YYFILL(1); check
  3. Before returning from the lex function, set the state to zero.

The rationale is to:

  1. Ensure the state of the lexer is recorded before potentially calling YYFILL(n).
  2. Ensure that the end-of-input limit is always checked prior to dereferencing the next character, invoke YYFILL(1) if needed.
  3. Since returning and re-invoking lex results in a slightly different code path than goto loop; (the label after /*!getstate:re2c*/), ensure that the state after returning a token is 0.

After some testing there still seems to be some edge cases I'm missing. I think my rationale for number 3 is slightly off, but it's late enough in this timezone that I'm going to wait till another day to rethink it.

Prepare for incoming hacky shell commands:

# Record lexer state, add boundary condition check
$ sed -i -re 's/^yy([0-9]+):/yy\1: YYSETSTATE(\1); if (1 == (in.lim - in.cur)) YYFILL(1);/' /tmp/nonblocking_push.cc

# Fix up dispatch table to include additional states
$ sed -i -e 's/switch (YYGETSTATE()) {/switch (YYGETSTATE()) {\n'"$(
  comm -23 \
    <(egrep -o '^yy[0-9]+' /tmp/nonblocking_push.cc 
      | tr -d 'y' 
      | sort) \
    <(grep 'goto yyFillLabel' /tmp/nonblocking_push.cc 
      | sed -re 's/^case ([0-9]+).*$/\1/' 
      | sort) \
  | while read state;
    do
      printf "case %s: goto yy%s;%s" "$state" "$state" '\n'
    done)"'/' /tmp/nonblocking_push.cc

# Set state to 0 before returning a lexeme
$ sed -i -re 's/return (OK|FAIL|WHITESPACE|WORD|THING);/YYSETSTATE(0); return \1;/' /tmp/nonblocking_push.cc

The lexer definition file is below. It includes a short main function that simulates TCP packet fragmentation in a way that tries to break the lexer's assumptions. As a I mentioned above, it currently succeeds in breaking the lexer. However, if you replace all the return (WHITESPACE|WORD|THING) lines with goto loop and don't run the hacky shell fixups, it mostly works (without of course returning the lexemes).

#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/select.h>
#include <assert.h>

/*!max:re2c*/
static const size_t SIZE = 4096;

struct input_t {
  char buf[SIZE + YYMAXFILL];
  char *lim;
  char *cur;
  char *tok;
  char *mark;
  int state;
  unsigned need;
  unsigned yyaccept;
  char yych;
  FILE *file;

  input_t(FILE * file)
          : buf()
  , lim(buf + SIZE)
  , cur(lim)
  , tok(lim)
  , mark(lim)
  , state(-1)
  , need(0)
  , yyaccept(0)
  , yych(0)
  , file(file)
  {}

  bool fill()
  {
    const size_t free = (tok - buf) + SIZE - (lim - buf);
    if (free < need) return false;
    const size_t prefix = (tok - buf);

    memmove(buf, tok, buf - tok + SIZE);
    lim -= prefix;
    cur -= prefix;
    tok -= prefix;
    mark -= prefix;
    size_t to_read = SIZE - (lim - buf);
    printf("> Reading input, can take up to %lu bytes, need %u bytes to satisfy lexer\n", to_read, need);
    size_t bytes_read = fread(lim, 1, to_read, file);
    lim += bytes_read;

    // quick make a copy of buffer with newlines replaced w/ _
    char b[40];
    snprintf(b, 40, "%s", buf);
    for(int i = 0; i < 40; i++) {
      if ('\n' == b[i]) { b[i] = '_'; }
    }
    printf("> Read %lu bytes from input, current buffer: >%.40s<\n", bytes_read, b);

    if (lim < buf + SIZE) {
      memset(lim, 0, YYMAXFILL);
    }
    if (feof(file)) {
      lim += YYMAXFILL;
    }
    return true;
  }
};

enum status_t { OK, FAIL, NEED_MORE_INPUT, WHITESPACE, WORD, THING };
const char * STATUSES[] = { "OK", "FAIL", "NEED_MORE_INPUT", "WHITESPACE", "WORD", "THING" };

static status_t lex(input_t &in)
{
#   define YYGETSTATE()  in.state
#   define YYSETSTATE(s) in.state = s
#   define YYFILL(n)     do { in.need = n; printf("< Returning for more input, want %lu bytes\n", n); return NEED_MORE_INPUT; } while (0)
/*!getstate:re2c*/
loop:
in.tok = in.cur;
/*!re2c
    re2c:define:YYCTYPE  = char;
    re2c:define:YYCURSOR = in.cur;
    re2c:define:YYLIMIT  = in.lim;
    re2c:define:YYMARKER = in.mark;
    re2c:variable:yych   = in.yych;

    *                       { printf("< Unexpected character >%.1s<\n", in.yych); return FAIL; }
    [\x00]                  { printf("< EOF\n");                                  return OK; }
    [\n ]+                  { printf("< whitespace\n");                           return WHITESPACE; }
    [a-zA-Z]+               { printf("< word\n");                                 return WORD; }
    "THING\nWITH\nNEWLINES" { printf("< Thing w/ newlines\n");                    return THING; }
*/
}

int main()
{

  int fds[2];
  pipe(fds);
  fcntl(fds[0], F_SETFL, fcntl(fds[0], F_GETFL) | O_NONBLOCK);
  FILE * write = fdopen(fds[1], "w");
  FILE * read = fdopen(fds[0], "r");
  input_t in(read);

  const char * packets[] = {
          "THING\n",
          "WITH\n",
          "NEWLINES\n",
          "H", "E", "L", "O", "\n",
          "HELO\n",
          "THING\nWITH\n",
          "NEWLINES"
  };
  size_t num_packets = sizeof(packets) / sizeof(char *);
  enum status_t result;

  size_t current_packet = 0;
  do {
    result = lex(in);
    printf("Received response from lexer: %s\n", STATUSES[result]);
    switch (result) {
      case NEED_MORE_INPUT:
        printf("Packet %lu written\n", current_packet);
        fwrite(packets[current_packet], strlen(packets[current_packet]), 1, write);
        current_packet++;
        fflush(write);
        in.fill();
        break;
      case FAIL:
        return 1;
    }
  } while (OK != result && current_packet < num_packets);

  result = lex(in);
  printf("Received response from lexer: %s\n", STATUSES[result]);
  printf("%lu / %lu packets sent, Closing down communication channel\n", current_packet, num_packets);

  if (current_packet == num_packets) {
    fclose(write);
    in.fill();
    result = lex(in);
    printf("Received response from lexer: %s\n", STATUSES[result]);
  }

  return 0;
}
//$ re2c -i --storable-state -o nonblocking_push.cc nonblocking_push.re
//$ # Run all the hacky shell commands here
//$ g++ -g nonblocking_push.cc -o nonblocking_push

I have patched both re2c (9c85f86) and your example: nonblocking_push.re.txt and generated nonblocking_push.cc.txt, and I now have the desired behaviour. No padding is needed: only the terminating EOF character appended at the end of input. Lexer keeps asking for more input until it finds EOF lexeme or fails on ill-formed input. Tail of the log:

$ ./nonblocking_push | tail -n 7
11 / 11 packets sent, Closing down communication channel
> Reading input, can take up to 4055 bytes
> Read 8 bytes from input, current buffer: >THING_WITH_NEWLINES_HELO_HELO_THING_WIT<
< Thing w/ newlines
Received response from lexer: THING
< EOF
Received response from lexer: OK

I used the new re2c feature, re2c:eof = 0; configuration. This basically tells re2c that whenever lexer sees EOF character, it must check if it is a genuine character or a need-more-input situation. In the latter case lexer must call YYFILL, and if it succeeds, re-read the last character and resume lexing. EOF is set to 0 in our case, but it can be anything as long as it matches the appended terminating character (lim[0] = 0; in fill definition).

The re2c:eof feature is new and there might be bugs/oversights in the implementation: for example, you'll get multiple complaints about unused labels from the C++ compiler. If you are exploring other tools, I suggest looking at Quex (I don't know if it supports push model though).

Focusing on correctness instead of speed, here's my hacky plan [...]

This is a good plan, quite re2c-way (but hacky indeed). But I think it wouldn't work because the final program model used by re2c is not exactly DFA, it's a so-called tunnel automaton. This is one of the optimizations used to reduce size of the code: in the tunnel automaton not all states are consuming, some states are just dispatching on characters and forwarding to other states, without advancing input cursor (hence the name "tunnel"). States end up being split into consuming head and dispatching body (but not all).

Or perhaps your plan would work (it would just save a lot of unnecessary things), but there is another thing that needs to be done: when a lexeme is successfuly lexed (pardon the tautology), lexer must reset state to the initial state. I fixed this in the above code.

I also deduplicated repeated blocks in main.

Your expediency is laudable. Where do I send donations?

Eh, but we only accept donations in the form of pull-requests. :D
(And you already sent one, so consider it sorted out.)

Well thanks a million anyway. When I have the full lexer ready, I'll send you a new test case for your suite.

@mwberry
As of commit 40951cd, there are some changes in how to use EOF rule with -f --storable state option. I've been revisiting various possible usage scenarious before releasing re2c-1.2, and I found and fixed some inconsistencies. The following changes are needed:

  • add EOF rule $ instead of [\x00] rule (EOF symbol is the one specified with re2c:eof configuration)
  • don't increment limit after reading the last byte (what you commented as "simulate the END packet (can as well send a normal packet)")

I'm attaching the fixed file: nonblocking_push.fi.re.txt, diff it against the one attached in the previous comment. Re2c is normally stable after release --- you've just happened to be one of the alpha users.