Regexp#MatchData breaks when Japanese and English are mixed
ongaeshi opened this issue · 0 comments
ongaeshi commented
OS: iOS (https://github.com/ongaeshi/ios-ruby-embedded/blob/master/Rakefile#L78)
App: RubyPico (https://github.com/ongaeshi/RubyPico)
I want to get the title from the web page and create the following script.
def title(url)
txt = Browser.get url
txt =~ /<title>(.*?)<\/title>/m
$1
end
Although it succeeds on the English page, it fails to match on the page where Japanese and English are mixed.
p title("https://github.com/iij/mruby-regexp-pcre")
# OK
#=> "GitHub - iij/mruby-regexp-pcre: Regexp for mruby (pcre version)"
p title("https://ja.wikipedia.org/wiki/%E3%83%81%E3%83%AB%E3%83%80")
# Failed
#=> "チルダ - Wikipedia</titl"
I create the static test.
# Head of https://ja.wikipedia.org/wiki/%E3%83%81%E3%83%AB%E3%83%80
str = <<EOS
<!DOCTYPE html>
<html class="client-nojs" lang="ja" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>チルダ - Wikipedia</title>
<script>
EOS
Result. (Break match data?)
#<MatchData "<title>チルダ - Wikipedia</title>
<scri" 1:"チルダ - Wikipedia</titl">