svent/sift

bad performance with regular expression

Closed this issue · 1 comments

# head -1 mylog.txt
Mar 21 13:15:55 c.xxx.com [21/Mar/2017: 13:15:54 +0800] 200 12.24.19.109 0.049 178 c.xxx.com POST /projects/myapi HTTP/1.0 http://c.xxx.com/projects/api?projectcode=l22294x6xwt 
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"
# wc -l mylog.txt
100000 mylog.txt
# time grep -c -E '(GET|POST).*' mylog.txt
81680

real    0m0.085s
user    0m0.075s
sys     0m0.009s
# time perl -ne 'print if $_ =~ /(GET|POST).*/;' mylog.txt |wc -l
81680

real    0m0.261s
user    0m0.275s
sys     0m0.109s
# time sift -c '(GET|POST).*' mylog.txt
81680

real    0m6.319s
user    0m6.303s
sys     0m0.034s
# time sift -c -e '(GET|POST).*' mylog.txt
81680

real    0m6.300s
user    0m6.274s
sys     0m0.048s
# sift -V
sift 0.9.0 (linux/amd64)
Copyright (C) 2014-2016 Sven Taute

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, version 3 of the License.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
# uname -a #CentOS release 6.5
Linux srv-c-web1 2.6.32-431.23.3.el6.x86_64 #1 SMP Thu Jul 31 17:20:51 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

svent commented

Thanks for your report - in this specific case the performance of sift is bad due to the underlying regular expression engine used in sift.
It does not optimize the search here as there is no fixed string - if you just search for GET.* the search is much faster.
Unfortunately this will not improve until the regex engine of Go receives more optimizations for cases like this and cannot be fixed in sift itself.