Email address filtering needs optimization or relaxed mode
Closed this issue · 1 comments
phileas-benchmark results show that email address detection is more CPU intensive (and requires more memory & stack space) than other regex-based filters.
Performance of single identifiers with 4k values:
mask_credit_cards - 35k calls/sec
mask_bitcoin_addresses - 31k calls/sec
mask_iban_codes - 26k calls/sec
mask_bank_routing_numbers - 27k calls/sec
mask_ssns - 16k calls/sec
mask_phone_numbers - 14k calls/sec
mask_email_addresses - 5k calls/sec 🔥
The current regex is known to be pretty intense -- so it might make sense to have a "relaxed" version that performs better without trading off too much accuracy?
@jzonthemtn I'm looking at a few regex variations that show better performance, but I need to do some more testing to see how accuracy is affected in the data I have available.
One interesting bit though -- the email address filter currently does not use the \b...\b
fencing that many of the regex-based filters use. Wrapping the current email address regex in \b...\b
roughly doubles performance on its own. I think that makes sense since it reduces how greedy some of those matches will be.
👆 Since we're also discussing use of \b
from a confidence standpoint (in #120), I thought this was kinda neat to see how much the \b...\b
fencing plays into performance too.