Chezure
Chez Scheme bindings for Rust's regular expression API.
Documentation is still under construction, and the APIs may be changed.
Installation
You can either download the pre-compiled binaries from release or build it by running build.ss
. Don't forget letting Chez know where is chezure
and chez-finalize
libraries, for example:
> scheme --libdirs '$PROJECT:$PROJECT/chez-finalize' --script build.ss
$PROJECT
is the path of the source code.
Usage
Essentially, every regular expression must be compiled by chezure-compile
:
(import (chezure)) ;; again, don't forget to setup the library path for Chez Scheme
(define re (chezure-compile "[0-9]+")) ;; => a compiled regular expression object
now you can use the compiled pattern to search for matches:
(define matches (chezure-find re "abc123def456")) ;; a list of chezure-match object
(chezure-match->alist (car matches))
;; => ((start . 3) (end . 6) (str . "123"))
(chezure-match->alist (cadr matches))
;; => ((start . 9) (end . 12) (str . "456"))
A chezure-match
object records the span information of the matched substring. Chezure also supports unicode strings:
(define re (chezure-compile "**"))
(define matches (chezure-find re "**是亚洲国家,盐城是**的一个城市")) ;; a list of chezure-match object
(chezure-match->alist (car matches))
;; => ((start . 0) (end . 2) (str . "**"))
(chezure-match->alist (cadr matches))
;; => ((start . 11) (end . 13) (str . "**"))
Chezure also implements capturing groups, but it's considered to be slower than ordinary patterns -- as stated by Rust's API:
Computing the capture groups of a match can carry a significant performance penalty, so their use in the API is optional.
(define re (chezure-compile "(?P<year>\\d{4})-(?P<month>\\d{2})-(?P<day>\\d{2})"))
(captures-names re) ;; => ("year" "month" "day")
(define all-captures (chezure-find-captures re "2019-08-17, 1884-10-01")) ;; => a list of captures
(define caps (car all-captures)) ;; select the first captures
(captures-ref caps 0) ;; => #<chezure-match start=0, end=10, str="2019-08-17">
(captures-ref caps 1) ;; => #<chezure-match start=0, end=4, str="2019">
(captures-ref caps "year") ;; => #<chezure-match start=0, end=4, str="2019">
;; reference by either a list or vector of names or indices
(captures-ref caps '(0 1 "month" "day"))
;; => (#<chezure-match start=0, end=10, str="2019-08-17">
;; #<chezure-match start=0, end=4, str="2019">
;; #<chezure-match start=5, end=7, str="08">
;; #<chezure-match start=8, end=10, str="17">)
(captures-ref caps (vector "year" "month" "day"))
;; => #(#<chezure-match start=0, end=4, str="2019">
;; #<chezure-match start=5, end=7, str="08">
;; #<chezure-match start=8, end=10, str="17">)
If you just want to access the captured string
, use captures-string-ref
:
(captures-string-ref caps 0) ;; => "2019-08-17"
(captures-string-ref caps "year") ;; => "2019"
(captures-string-ref caps '("year" "month" "day")) ;; => ("2019" "08" "17")
Finally, split
and replace
are implementted:
(define re (chezure-compile "[0-9]+"))
;;; split
(chezure-split re "abc123def") ;; => ("abc" "def")
(chezure-split re "abc123def456" 1) ;; split only once (maximum), 0 means no limit
;; => ("abc" "def456")
(chezure-split re "abc123def456" 0 #t) ;; preverse the matched substring
;; => ("abc" "123" "def" "456" "")
(chezure-split re "abc123def456" 0 #t #t) ;; remove empty strings
;; => ("abc" "123" "def" "456")
;;; replace
(chezure-replace re "abc123def" "<NUMBER>") ;; => "abc<NUMBER>def"
(chezure-replace re "abc123def456" "<NUMBER>" 1) ;; replace only once (maximum), 0 means no limit
;; => "abc<NUMBER>def456"
;; the third aargument (replacement) can also be a procedure that takes only one argument
;; when it's a procedure, it will be applied to the current captures object and expect a string returned to become the actual replacement
;; since it needs to manipulate captures, `chezure-find-captures` will be used and thus will cause performace issues
(define re (chezure-compile "(?P<year>\\d{4})-(?P<month>\\d{2})-(?P<day>\\d{2})"))
(define (repl caps)
(if (string=? (captures-string-ref caps "year") "2019")
"<NOW>"
"<PAST>"))
(chezure-replace re "2019-10-03, 1900-10-20" repl) ;; => "<NOW>, <PAST>"
APIs
chezure-compile
procedure: (chezure-compile pattern)
procedure: (chezure-compile pattern flags)
procedure: (chezure-compile pattern flags options)
returns: a `chezure` object holding the compiled regular expression
library: (chezure)
pattern
must be a string, flags
must be a list of valid flags:
- ignorecase, as the case insensitive (i) flag.
- multiline, as the multi-line matching (m) flag. (^ and $ match new line boundaries.)
- dotnl, as the any character (s) flag. (. matches new line.)
- swap-greed, as the greedy swap (U) flag. (e.g., + is ungreedy and +? is greedy.)
- space, as the ignore whitespace (x) flag.
- unicode, as the Unicode (u) flag.
and options
must be a list of arguments (size-limit
and dfa-size-limit
respectively) to setup the regular expression compiler's options. See Rust's documentation for size-limit
and dfa-size-limit
If there was a problem compiling the pattern, an error with debug information will be raised.
chezure?
As the predicate of chezure
type.
chezure-match
procedure: (chezure-match? x)
procedure: (chezure-match-name x)
procedure: (chezure-match-start x)
procedure: (chezure-match-end x)
procedure: (chezure-match-str x)
procedure: (chezure-match->alist x)
library: (chezure)
chezure-match
is an object recording the information of the matched substring. chezure
exports APIs to access its fields.
chezure-has-match?
procedure: (chezure-has-match? chezure str)
procedure: (chezure-has-match? chezure str start)
returns: a boolean indicating if there exists a match
library: (chezure)
chezure-has-match?
returns #t
if and only if chezure
matches anywhere in in the given string str
. start
is the position at which to start searching, hence it must be a non-negative fixnum. If start
is not given, 0
will be applied.
chezure-shortest-match
procedure: (chezure-shortest-match chezure str)
procedure: (chezure-shortest-match chezure str start)
returns: a non-negative fixnum or a boolean
library: (chezure)
chezure-shortest-match
returns the #f
if and only if chezure
matches nowhere in the given string str
. Otherwise, if a match is found, then return the end
location of the given str
. The end location is the place at which the regex engine determined that a match exists, but may occur before the end of the proper leftmost-first match.
start
is the position at which to start searching, hence it must be a non-negative fixnum. If start
is not given, 0
will be applied.
chezure-find
procedure: (chezure-find chezure str)
procedure: (chezure-find chezure str limit)
returns: a list of `chezure-match`, if any
library: (chezure)
chezure-find
returns a list of chezure-match
objects, if chezure
matches anywhere in the given string str
. limit
sets the maximum number of collected chezure-match
; 0
means no limit at all.
captures
procedure: (captures? x)
library: (chezure)
A captures
object represents capturing groups in chezure
. Internally, it records all group names, captured matches, and how to access those matches by either index or name.
captures-names
procedure: (captures-names x)
returns: a list of group names
library: (chezure)
x
must be either a chezure
or a captures
object. captures-names
returns a list of group names:
(define re (chezure-compile "(?P<first_name\\w+) (?P<last_name>\\w+)"))
(captures-names re) ;; => ("first_name" "last_name")
chezure-find-captures
procedure: (chezure-find-captures chezure str)
procedure: (chezure-find-captures chezure str limit)
returns: a list of found `captures`
library: (chezure)
chezure-find-captures
returns a list of chezure-captures
objects, if chezure
find capturing groups anywhere in the given string str
. limit
sets the maximum number of collected chezure-match
; 0
means no limit at all.
captures-ref
procedure: (captures-ref caps indices)
returns: the referenced `chezure-match` object(s)
library: (chezure)
caps
must be a captures
object, indices must be one of these types:
- a non-negative fixnum
- a string
- a list of non-negative fixnum or string
- a vector of non-negative fixnum or string
When a non-negative fixnum or a string is given, captures-ref
returns the corresponding chezure-match
object. If a list or a vector is given, all contained indices will be mapped (by either map
or vector-map
) to the corresponding chezure-match
object.
If any index is invalid, an error will be raised.
captures-string-ref
procedure: (captures-string-ref caps indices)
returns: the `str` field(s) of the corresponding `chezure-match` object(s)
library: (chezure)
Like captures-ref
, but returns the str
field (s) of the corresponding chezure-match
object(s).
chezure-compile-set
procedure: (chezure-compile-set patterns)
procedure: (chezure-compile-set patterns flags)
procedure: (chezure-compile-set patterns flags options)
returns: a `chezure-set` object
library: (chezure)
Rust's regex
provides an API to compile a set of patterns (a list of strings). flags
and options
here are handled the same as in chezure-compile
. It returns a chezure-set
object.
chezure-set?
As the predicate of chezure-set
type.
chezure-set-has-match?
procedure: (chezure-set-has-match? chezure-set str)
procedure: (chezure-set-has-match? chezure-set str start)
returns: a boolean indicating if there exists a match
library: (chezure)
chezure-set-has-match?
returns #t
if and only if chezure-set
matches anywhere in in the given string str
. start
is the position at which to start searching, hence it must be a non-negative fixnum. If start
is not given, 0
will be applied.
chezure-set-matches
procedure: (chezure-set-matches chezure-set str)
procedure: (chezure-set-matches chezure-set str start)
returns: a list booleans
library: (chezure)
chezure-set-matches
compares each regex in the patterns set against the given string str
and returns a list of booleans indicating the match result of each pattern.
Booleans are ordered in the same way as the chezure-set
was compiled. For example, index 0 of matches corresponds to the first pattern passed to chezure-compile-set
.
start
is the position at which to start searching, hence it must be a non-negative fixnum. If start
is not given, 0
will be applied.
chezure-split
procedure: (chezure-split chezure str)
procedure: (chezure-split chezure str limit)
procedure: (chezure-split chezure str preserve?)
procedure: (chezure-split chezure str preserve? remove-empty?)
returns: the splited string
library: (chezure)
chezure-split
splits string str
by using chezure
. limit
sets the maximum number of splited ocurrances; 0
means no limit at all. If preserve?
is set to #t
, matched substring will be preserved. If remove-empty?
is set to #t
, all empty strings (including strings that only contain whitespace characters) will be filtered out.
limit
, preserve
and remove-empty?
are set to 0
, #f
and #f
by default.
chezure-replace
procedure: (chezure-replace chezure str repl)
procedure: (chezure-replace chezure str repl limit)
returns: the replaced string
library: (chezure)
chezure-replace
replaces given string str
by repl
, which is either a string or a procedure. when repl
is a procedure, it will be applied to the current captures
object and expect a string returned to become the actual replacement. limit
sets the maximum number of splited ocurrances; 0
means no limit at all.