mattn/mruby-onig-regexp

Trouble splitting string with Regexp

Closed this issue · 50 comments

rjst commented

Hi,

I'm having some issues splitting strings with Regexps in mruby-onig-regexp

The following:

"<%= 1 + 1 %>".split(/(<%=)|(%>)/)

doesn't split and returns ["<%= 1 + 1 %>"]. The same expression splits fine in MRI. I checked that mruby-onig-regexp substitutes String#split with a Regexp-aware implementation, so either I'm doing something seriously wrong or maybe there's a bug in mruby-onig-regexp

thanks,
Ricardo

rjst commented

Also, this returns nil, which is a bit strange:

/(%)/.match("<%= 1 + 1 %>")
rjst commented

Another thing that may be relevant, I'm getting a segfault with the following code (in mirb) adapted from the docs:

Regexp.compile('abc').match('abcdef')

I get following

Are you using mruby-require? Then do require 'mruby-onig-regexp'. If not, you will get same result as mine.

rjst commented

Hi,

Thanks for your answer.

Requiring mruby-onig-regexp does improve things, but not completely. For example, this no longer returns nil and matches as supposed:

Regexp.compile('abc').match('abcdef')

but

"<%= 1 + 1 %>".split(/(<%=)|(%>)/)

still doesn't split properly (returns an array with a single item instead of 4 items). The same expression works well in MRI.

thanks,
Ricardo

Ah, I understand your said just now. It seems broken. I'll fix in later.
Try to use mruby-pcre-regexp.

rjst commented

I tried that, but mruby-pcre-regexp isn't compiling at this time (against mruby/master). I get the following error:

build/mrbgems/mruby-pcre-regexp//src/mruby_pcre_regexp.c:168: error: ‘PCRE_CASELESS’ undeclared (first use in this function)

PCRE_CASELESS is defined as 0x00000001 in my pcre.h

Below is a part of version information.

#define PCRE_MAJOR          8
#define PCRE_MINOR          32
#define PCRE_PRERELEASE     
#define PCRE_DATE           2012-11-30
rjst commented

didn't have pcre properly installed, it's working now. it doesn't really give the same result as MRI, but does split the string (returns ["", " 1 + 1 ", ""]). I'll recheck later tonight and get back to you.

rjst commented

Hi,

PCRE does work better, but since the regular expressions I'm using were written for oniguruma, it doesn't split exactly how I need it to. since I'm porting existing software, it wouldn't be ideal to change the regexps.

the ideal for me would be to use oniguruma, if the bug above is fixed, otherwise I'll change the regexps to use PCRE (or I'll try fixing the oniguruma bug).

thanks,
Ricardo

Could you please write test case that you expected?

rjst commented

ok, I'll do that. I tried running mrbtest, but it segfaults. is there any way to fix it, or should I install something like https://github.com/iij/mruby-mtest ?

thanks,
Ricardo

Ah, I don't mind if you write simple script that reproduce it.

rjst commented

Hi,

This very basic script demonstrates the problem. If you run it in MRI it will work fine, in mruby with onig-regexp it will fail.

def test_split
  raise 'Split by regexp returned wrong results' unless '<%= 1 + 1 %>'.split(/(<%=)|(%>)/) == ['', '<%=', ' 1 + 1 ', '%>']
  'Split by regexp correct'
end

test_split

After I get the test suite running I'll add it as a proper test.

thanks,
Ricardo

Currently, you have two cases issue, isn't it? :)

rjst commented

Sorry mattn, couldn't understand your question :(

Sorry, I was thinking you has more problems, so I wonder it's enough
cases to check.

On 4/4/13, Ricardo Trindade notifications@github.com wrote:

Sorry mattn, couldn't understand your question :(


Reply to this email directly or view it on GitHub:
#4 (comment)

  • Yasuhiro Matsumoto
rjst commented

For now this is the only problem I'm having. If I find other problems with onig-regexp, I'll create test scripts for them.

thanks,
Ricardo

I fixed it. Please try in latest.

rjst commented

It does work, but only if mruby-require is being used. If mruby-require is not being included in the build, it still fails as before. This is not a problem for me as I can include mruby-require, but it doesn't appear to be 100% ok yet.

thanks,
Ricardo

Did you do require ?

mruby define String#split even if Regexp is disabled.

rjst commented

When not using mruby-require, I didn't require since require isn't defined. But in that case Regexp == OnigRegexp, so I'm assuming that onig-regexp gets included, no ?

  • include mruby-onig-regexp', not includemruby-require`
    should work correctly.
  • include mruby-onig-regexp', includemruby-require you need dorequire 'mruby-onig-regexp'`
rjst commented

This case doesn't work correctly for me:

  • include mruby-onig-regexp', not includemruby-require`
    should work correctly (in this case the array returned from splitting the string above has just one item, instead of 4)

In this case, mruby defines default String#split that Regexp disabled. It works but it treat pattern as string. So return non-splited strings.

rjst commented

So mruby-onig-regexp doesn't fully work unless it's used with mruby-require ?

It should work. :-(

rjst commented

I get the same results as in your screenshot when I include mruby-require in the build. That case works well for me. My problem is when I don't include mruby-require in the build.

You mean follow?

MRuby::Build.new do |conf|
  # load specific toolchain settings
  toolchain :gcc

  conf.gem "#{root}/mrbgems/mruby-sprintf"
  conf.gem "#{root}/mrbgems/mruby-print"
  conf.gem "#{root}/mrbgems/mruby-math"
  conf.gem "#{root}/mrbgems/mruby-time"
  conf.gem "#{root}/mrbgems/mruby-struct"
  conf.gem "#{root}/mrbgems/mruby-enum-ext"
  conf.gem "#{root}/mrbgems/mruby-string-ext"
  conf.gem "#{root}/mrbgems/mruby-numeric-ext"
  conf.gem "#{root}/mrbgems/mruby-array-ext"
  conf.gem "#{root}/mrbgems/mruby-hash-ext"
  conf.gem "#{root}/mrbgems/mruby-random"
  conf.gem "#{root}/mrbgems/mruby-eval"
  conf.gem :git => 'git@github.com:mattn/mruby-onig-regexp.git'
  #conf.gem :git => 'git@mattn/mruby-require.git'
end

I built mruby with build_config.rb above. And working good for me.

rjst commented

This is what happens to me - http://cl.ly/image/3g3C130e2g3B

Probably there is a difference in the mrbgems we're including, I'll compare both lists

Did you update mruby-onig-regexp?

rjst commented

I'm cleaning mruby/build between tests, so I'm getting a new checkout of mruby-onig-regexp each time. I'm also using latest mruby (just updated now).

I'm still getting the same result - http://cl.ly/image/1k14130s2U0y

What toolchain do you use? mingw? msys? visual studio?

Could you please compile & run follow?

#include <stdio.h>

int
main(int argc, char* argv[]) {
  puts(onig_version());
  return 0;
}

I get 5.9.3

rjst commented

I'm using gcc (on macOS X).

I'm also with oniguruma 5.9.3

Try following

onig.c

#include <stdio.h>
#include <memory.h>
#include <onigposix.h>

int
main(int argc, char* argv[]) {
  size_t nmatch = 999;
  regmatch_t match[nmatch];
  int regerr;
  int i;
  char err[256];
  char *ptr = "<%= 1 + 1 %>";
  regex_t reg;

  regerr = regcomp(&reg, "(<%=)|(%>)", REG_EXTENDED | REG_NEWLINE);
  if (regerr) {
    regerror(regerr, &reg, err, sizeof(err));
    regfree(&reg);
    puts(err);
    exit(1);
  }
  for (i = 0; i < nmatch; i++)
    match[i].rm_so = -1;
  regerr = regexec(&reg, "<%= 1 + 1 %>", nmatch, match, 0);
  if (regerr == REG_NOMATCH) {
    puts("no match");
    exit(1);
  }
  if (regerr) {
    regerror(regerr, &reg, err, sizeof(err));
    puts(err);
    exit(1);
  }

  for (i = 0; i < nmatch; i++) {
    if (match[i].rm_so != -1) {
      fwrite(ptr + match[i].rm_so, match[i].rm_eo - match[i].rm_so, 1, stdout);
      putchar('\n');
    }
  }
  regfree(&reg);
  return 0;
}
$ gcc -o onig onig.c -lonig
$ ./onig

I got

<%=
<%=

Hmm, it seems working for me. I can't understand.

> "a1b".split /\d/
 => ["a", "b"]

Is this working?

rjst commented

No, also doesn't work - http://cl.ly/image/3c3t2e0W2C3Z

At this point the only thing I can think of is the platform difference - mac vs windows. Do you have anyone near you with a Mac that can test ? I do have windows machines, but they don't have cygwin etc.. so it wouldn't be so easy (I can do it on the weekend if no other option exists)

But you get same result in onig.c, so I don't think platform differences.

rjst commented

Yes, you are right. I'll try again later tonight with a fresh checkout of mruby, to rule out any other issue.

Thanks for your work

On 4/5/13, Ricardo Trindade notifications@github.com wrote:

Yes, you are right. I'll try again later tonight with a fresh checkout of
mruby, to rule out any other issue.


Reply to this email directly or view it on GitHub:
#4 (comment)

  • Yasuhiro Matsumoto
rjst commented

Just tried again with a new git checkout, and getting the same behaviour.

  • With mruby-require, splitting works fine
  • When not using mruby-require, an array with a single element is returned, instead of the expected result

Could you please try to add printf("something\n"); code to make sure
passing C's functions?

On 4/12/13, Ricardo Trindade notifications@github.com wrote:

Just tried again with a new git checkout, and getting the same behaviour.

  • With mruby-require, splitting works fine
  • When not using mruby-require, an array with a single element is returned,
    instead of the expected result

Reply to this email directly or view it on GitHub:
#4 (comment)

  • Yasuhiro Matsumoto
rjst commented

I did that and found a difference. With mruby-require, the onig_regexp_match C function is called twice. Without mruby-require, it's only called once.

http://cl.ly/image/0o3k3X3k3S3D - The difference between both mirb sessions show shows this (there was a build in between)

Do you get follow?

> /\d/.match('a3d')[0]
 => "3"

Cases that match method return nil,

Invalid argument for match (ex match("a3d", -1))
https://github.com/mattn/mruby-onig-regexp/blob/master/src/mruby_onig_regexp.c#L93

Or no matches
https://github.com/mattn/mruby-onig-regexp/blob/master/src/mruby_onig_regexp.c#L106

First cast should be checked with

> /\d/.match("a3d", -1)
 => nil
> /\d/.match("a3d", 0)
 => #<OnigMatchData:0xe0e6f0>

And second case should be checked with

> //
 => #<OnigRegexp:0xe0ea98 @source="", @regexp=#<Object:0xe0ea80>>

@regexp shouldn't be nil

Probably, this issue was fixed in cc6b0b3