Regexp#MatchData breaks when Japanese and English are mixed

Question

Regexp#MatchData breaks when Japanese and English are mixed

ongaeshi opened this issue 8 years ago · 0 comments

OS: iOS (https://github.com/ongaeshi/ios-ruby-embedded/blob/master/Rakefile#L78)
App: RubyPico (https://github.com/ongaeshi/RubyPico)

I want to get the title from the web page and create the following script.

def title(url)
  txt = Browser.get url
  txt =~ /<title>(.*?)<\/title>/m
  $1
end

Although it succeeds on the English page, it fails to match on the page where Japanese and English are mixed.

p title("https://github.com/iij/mruby-regexp-pcre")
# OK
#=> "GitHub - iij/mruby-regexp-pcre: Regexp for mruby (pcre version)"

p title("https://ja.wikipedia.org/wiki/%E3%83%81%E3%83%AB%E3%83%80")
# Failed
#=> "チルダ - Wikipedia</titl"

I create the static test.

# Head of https://ja.wikipedia.org/wiki/%E3%83%81%E3%83%AB%E3%83%80
str = <<EOS
<!DOCTYPE html>
<html class="client-nojs" lang="ja" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>チルダ - Wikipedia</title>
<script>
EOS

Result. (Break match data?)

#<MatchData "<title>チルダ - Wikipedia</title>
<scri" 1:"チルダ - Wikipedia</titl">