Regexp SAR - a python module for multi match event handling regular expression engine
SAR is a new way of handling regular expression which allows us to run many regular expressions (only limitation being the available memory) at once. When adding a regexp, there is also a related callback that will be called upon each match in the same order in which they appear on the text
Before installation, make sure you have the latest version of pip:
pip install --upgrade pip
Install regexp-sar:
pip install regexp-sar
from regexp_sar import RegexpSar
'''
This example will find and print second match of each regexp,
while also showing what regexp was caught
'''
from regexp_sar import RegexpSar
sar = RegexpSar()
# string to be matched against
match_str = "hello world 123 abc 456 789"
# list of regexps, first item in pair is the regexp,
# second item in the pair is a unique word for that regexp
regexps = [
['\w+', 'word'],
['\d+', 'number'],
]
# add all regexps in a loop
for cur_regexp in regexps:
def find_second_match(description):
match_count = 0
match_val = None
# define inner method, to use with closure
def callback(from_pos, to_pos):
nonlocal match_count, match_val
match_count += 1
if match_count == 2:
print("Match: " + str(description) + ": " + match_str[from_pos:to_pos])
sar.continue_from(to_pos)
return callback
# add regexp with a callback
sar.add_regexp(cur_regexp[0], find_second_match(cur_regexp[1]))
# run match
sar.match(match_str)
'''
Output:
Match: word: world
Match: number: 456
'''
creates a new sar instance with its own regexps and callbacks, many instances can be built at once
adds a regexp into the sar instance, recieved 2 parameters:
- regexp - the required regexp
- callback - the callback which will be called upon match, the callback receives 2 parameters -
- from_pos - the start position of the match in the matched string
- to_pos - the end position of the match in the matched string (exclude to_pos)
sar = RegexpSar()
sar.add_regexp('abc', lambda from_pos, to_pos: print("Match: " + str(from_pos) + "->" + str(to_pos)))
sar.match("hello abc world") # Match: 6->9
begins a match against the previously defined regexps on the received string. receive 1 parameter:
-
string to be matched with
-
NOTE: this is syntactic sugar for match_from(match_str, 0)
acts like match but starts from a custom position the search. receive 2 parameters:
- string to be matched with
- start position of the match
looks for a match from a specific character only, and will not continue to search for matches in the following characters
Can be called only during a match/match_from, will continue the next match character from the given character index
receive 1 parameter:
- position for next match
Can be called only during a match/match_from, will stop the match after current char matching has ended
- . - matches any character
- \d - matches a digit character (checked by isdigit method)
- \w - matches an alphanumeric character (checked by isalnum method)
- \a - matches an alpha character (checked by isalpha method)
- \s - matches a space character (checked by isspace method)
- ^ - matches a character NOT followed by the match (i.e.
\^\d+
will match all non digit strings)
- '?' - matches 1 or 0 times
- '*' - matches 0 or more times
- '+' - matches 1 or more times
in order to match the '\' character, it needed to be followed by 3 more backslashes (4 in total) since python string takes 2 backslashes to be represented as one
sar = RegexpSar()
sar.add_regexp('\\\\', lambda from_pos, to_pos: print("Match: " + str(from_pos) + "->" + str(to_pos)))
sar.match('a\\b') # Match: 1->2
Examples may be found in the test_oousage.py
file, and in the examples directory
- For more information, visit my introduction post
- For practical example, visit Practical example where SAR comes to play.
- For my article regarding incorporating SAR into DLP (Data Leak Prevention), visit DLP (Data Loss Prevention) in SAR
Currently not supported. May be added in future update
Noam Nisanov - noam.nisanov@gmail.com