mattbucknall/subreg

Adding support for range matches such as [0-9a-z] (bug fix solution)

Closed this issue · 9 comments

Hi,

I'm a PhD student who is trying to leverage your subreg library to build a regex app in SGX. I think your library is great because it supports capture groups. However, I realized your library doesn't allow the regular expression syntax using square brackets, such as [0-9a-z]. I wonder if the range matches like this is innately not supported by your code.... If so, should we implement it by ourselves?

At the end of the the following code snippet from sugreg.c's parse_literal(state_t* state) function:

  else if ( rc == '\\' )
   {
      rc = state->regex[0];
      if ( is_end(rc) ) return SUBREG_RESULT_INVALID_METACHARACTER;

      switch (rc)
      {
      case 'D':   result = invert_match(match_digit(c));  if (!c) result = 0;    break;
      case 'H':   result = invert_match(match_hexadecimal(c)); if (!c) result = 0;  break;
      case 'S':   result = invert_match(match_whitespace(c)); if (!c) result = 0;    break;
      case 'W':   result = invert_match(match_word(c));  if (!c) result = 0;     break;
      case 'd':   result = match_digit(c);                  break;
      case 'h':   result = match_hexadecimal(c);              break;
      case 's':   result = match_whitespace(c);             break;
      case 'w':   result = match_word(c);                 break;

      default:
         result = decode_non_class_metacharacter(state, &rc);
         if ( is_bad_result(result) ) return result;

         result = match_char(c, rc);
         if ( is_match_result(result) ) state->input++;

         return result;
      }

      state->regex++;
   }

Add the following code:

   else if (rc == '[')
   {
      char* char_match_list = NULL;
      int char_match_num = 0;
      while (1)
      {  rc = (state->regex++)[0];
         if (!rc)
         {  if (char_match_list)
               free(char_match_list);
            return SUBREG_RESULT_MISSING_BRACKET;
         }
         else if (rc == '\\')
         {  rc = (state->regex++)[0];
            if (!rc)
            {  if (char_match_list)
                  free(char_match_list);
               return SUBREG_RESULT_MISSING_BRACKET;
            }
            char_match_list = realloc(char_match_list, sizeof(char) * ++char_match_num);
            char_match_list[char_match_num - 1] = rc;
         }
         else if (rc == '-')
         {  rc = (state->regex++)[0];
            if (!rc)
            {  if (char_match_list)
                  free(char_match_list);
               return SUBREG_RESULT_MISSING_BRACKET;
            }
            int batch_num = rc - char_match_list[char_match_num - 1];
            if (batch_num < 0)
            {  printf("Error: start-char is bigger than the end-char\n");
               if (char_match_list)
                  free(char_match_list);
               return SUBREG_RESULT_MISSING_BRACKET;
            }
            char start_char = char_match_list[char_match_num - 1];
            char_match_list = realloc(char_match_list, sizeof(char) * (char_match_num + batch_num));
            for (int i = 0; i < batch_num; i++)
            char_match_list[char_match_num + i] = start_char + 1 + i;
            char_match_num += batch_num;
            if (batch_num < 0)
            {  printf("Error: start-char is bigger than the end-char\n");
               if (char_match_list)
                  free(char_match_list);
               return SUBREG_RESULT_MISSING_BRACKET;
            }
            char start_char = char_match_list[char_match_num - 1];
            char_match_list = realloc(char_match_list, sizeof(char) * (char_match_num + batch_num));
            for (int i = 0; i < batch_num; i++)
            char_match_list[char_match_num + i] = start_char + 1 + i;
            char_match_num += batch_num;
         }
         else if (rc == ']')
         {  break;
         }
         else
         {  char_match_list = realloc(char_match_list, sizeof(char) * ++char_match_num);
            char_match_list[char_match_num - 1] = rc;
         }
      }
      //printf("Total char num: %d\n", char_match_num);
      result = 0;
      for (int i = 0; i < char_match_num; i++)
      {  if (char_match_list[i] == c)
         {  result = 1;
            break;
         }
         //printf("%c ", char_match_list[i]);
      }
      //printf("\n");
      if (char_match_list)
         free(char_match_list);
   }

Hi,

I'm glad you're finding the library useful. Indeed, Subreg does not currently handle sets or ranges, although it would be a nice feature to have.

Thank-you for the code snippet above - Unfortunately, I cannot merge it as-is because it makes use of dynamic memory allocation. One of Subreg's features is that it does not perform any dynamic memory allocation and I don't want that to change - I originally wrote subreg for use in embedded systems where malloc etc. is undesirable. printf isn't allowed either, although I appreciate you're only using it here for illustrative purposes.

The solution is to directly interpret each set/range on-the-fly as it is encountered in the regular expression. I'll have a think about how best to do that.

Matt.

Oh I see, sorry for not being aware of that aspect... We can avoid using malloc() by using a static char array of 256 buckets because the maximum number of possible values in a square bracket expression [] is 256. Here is the version of code that doesn't use malloc().

   else if (rc == '[')
   {
      char char_match_list[256];
      int char_match_num = 0;
      char last_char;
      while (1)
      {  if (char_match_num > 256)
         {  //printf("Error: char_match_num > 256: %d\n", char_match_num);
            return SUBREG_RESULT_MISSING_BRACKET;
         }
         last_char = rc;
         rc = (state->regex++)[0];
         if (!rc)
            return SUBREG_RESULT_MISSING_BRACKET;
         else if (rc == '\\')
         {  rc = (state->regex++)[0];
            if (!rc)
               return SUBREG_RESULT_MISSING_BRACKET;
            char is_redundant = 0;
            for (int i = 0; i < char_match_num; i++)
            {  if (char_match_list[i] == rc)
               {  is_redundant = 1;
                  break;
               }
            }
            if (is_redundant)
               continue;
            char_match_list[char_match_num++] = rc;
         }
         else if (rc == '-')
         {  rc = (state->regex++)[0];
            if (!rc)
               return SUBREG_RESULT_MISSING_BRACKET;
            int batch_num = rc - last_char;
            if (batch_num < 0)
            {  //printf("Error: start-char is bigger than the end-char\n");
               return SUBREG_RESULT_MISSING_BRACKET;
            }
            for (int i = 0; i < batch_num; i++)
            {
               char is_redundant = 0;
               for (int j = 0; j < char_match_num; j++)
               {  if (char_match_list[j] == last_char + 1 + i)
                  {  is_redundant = 1;
                     break;
                  }
               }
               if (is_redundant)
                  continue;
               char_match_list[char_match_num++] = last_char + 1 + i;
            }
         }
         else if (rc == ']')
         {  break;
         }
         else
         {  char is_redundant = 0;
            for (int i = 0; i < char_match_num; i++)
            {  if (char_match_list[i] == rc)
               {  is_redundant = 1;
                  break;
               }
            }
            if (is_redundant)
               continue;
            char_match_list[char_match_num++] = rc;
         }
      }
      result = 0;
      for (int i = 0; i < char_match_num; i++)
      {  if (char_match_list[i] == c)
         {  result = 1;
            break;
         }
      }
   }

Hi - Again, because of SubReg's origins (designed for use in embedded systems with limited resources), I'm trying to keep memory usage light. Allocating 256 bytes on the stack (possibly recursively) is too excessive. Of course, if that works for your application, great!... The beauty of open-source is that you can tailor the code to your needs.

When I have some spare time, I'll look at adding ranges and sets in a way that does not require an LUT.

Sorry again.. You can allocate "char char_match_list[256];" as a single global array instead from the above code, in which case you will statically allocate only extra 256 bytes when you run the program. I believe we can do this because there should be no recursiveness within a square bracket.

I'm reopening this to remind myself to implement support at some point.

I'm reopening this to remind myself to implement support at some point.

Hi, what's the status of that?

I just found this library, and besides admiring its minimalism, I immediately noticed the lack of sets and ranges. I have always used them, even if when they could be replaced with \d, \D, \w, \W, etc – ranges are just more readable.

Let me know if you still remember this. In case you no longer plan to do this, I could try implementing this, but no promises, obviously.

Turned out to be a quick fix, here's the merge request!

Thanks for this - Haven't had a chance to review it yet. I'll get to it at the weekend.