/bmatch

Faster search for fixed patterns with Go

Primary LanguageGo


bmatch.go


Build Status

bmatch is faster fixed pattern search for Go.

Go provides different search mechanisms to find the indices of a fixed string or byte pattern, that are backed by different search algorithms:

  • strings.Index switches by the pattern length
    • 1: generic assembler function strings•IndexByte(..) (i.e. for OSX/darwin see /usr/local/go/src/runtime/asm_amd64.s)
    • 1-31: generic assembler function strings•indexShortString(..) (i.e. for OSX/darwin see /usr/local/go/src/runtime/asm_amd64.s)
    • 31 calling a Rabin-Karp search algorithm using a rolling addition hash and use string comparison to prove the result

  • strings.Replace for single string replacing invokes a Boyer-Moore search over a string in /usr/local/go/src/strings/search.go implementing a stringFinder type
  • bytes.Index using generic assembler code (bytes•IndexByte(..)) to find the index of the first element of the pattern (i.e. for OSX/darwin see /usr/local/go/src/runtime/asm_amd64.s) and then compares the following sequence using another assembler function (Equal(..)).

The before mentioned assembler routines compare each byte of the haystack one by one. See unsafeMEMCHR.go for a faster approach. All but the Boyer-Moore search are relatively slow - even in comparison to python str.find function (implemented in C). bmatch's underlying algorithms outperform all of Go's search functions.


If you installed bmatch before 2016-03-20 it is recommended that you reinstall bmatch as there was a bugfix for "needles at haystack start"


Usage

Install bmatch by the usual

go get github.com/AndreasBriese/bmatch

In your go code use import "github.com/AndreasBriese/bmatch" and apply it on []byte types of the haystack (byte sequence to search in) and needle (pattern to search for).

index, err := bmatch.Index(&haystack, &needle) gives the first (left) index or -1 if not present,

indices, err := bmatch.FindAll(&haystack, &needle) to get an []int array with indices of (overlapping!) occurrences, or

count, err := bmatch.Count(&haystack, &needle) to get the number of (overlapping!) occurences of needle in haystack.

Benchmarks (go test -bench . cpu=1)

 ###############
 bmatch.go
 Haystack: ./theciaworldfactb00571.zip loaded (3013205 bytes)
 Alphabet size: 93
 
 PASS
 BenchmarkM_Bmatch_10_C          	       1	1002289595 ns/op 
 BenchmarkM_BoyerMoore_10_C       	       1	2683892806 ns/op
 BenchmarkM_BytesIndex_10_C       	       1	2630128009 ns/op
 BenchmarkM_Bmatch_30_C           	       2	 528011588 ns/op
 BenchmarkM_BoyerMoore_30_C       	       1	1402345246 ns/op
 BenchmarkM_BytesIndex_30_C       	       1	2654263863 ns/op
 BenchmarkM_Bmatch_1024_C         	      20	 105107250 ns/op
 BenchmarkM_BoyerMoore_1024_C     	       5	 214846485 ns/op
 BenchmarkM_BytesIndex_1024_C     	       1	2592599288 ns/op
 BenchmarkM_Bmatch_30_FI        	      10	 104080809 ns/op
 BenchmarkM_BoyerMoore_30_FI    	       5	 245957698 ns/op
 BenchmarkM_StringsIndex_30_FI  	       3	 392142229 ns/op
 BenchmarkM_BytesIndex_30_FI    	       2	 776670641 ns/op
 BenchmarkM_Bmatch_1024_FI      	      30	  43820654 ns/op
 BenchmarkM_BoyerMoore_1024_FI  	      10	 103821315 ns/op
 BenchmarkM_StringsIndex_1024_FI	       1	1306085958 ns/op
 BenchmarkM_BytesIndex_1024_FI  	       1	1276277951 ns/op

on a MacBookPro 2013 with i7 and 8GB Ram searching for 500 random patterns plus 20 patterns that are not present in the "1995 CIA World Factbook" (~3MB english natural text). Benchmark naming: .._searchFunction_patternMaximumLength_C=count|FI=first left Index

go test will try to download the test corpus from http://archive.org if it is not present in the folder.

License
bmatch.go (C)opyright 2016 Andreas Briese, eduToolbox@Bri-C GmbH, Sarstedt, with MIT license - see the headers in the code in the subfolders of the various search algorithms for details and reference to their predecessors (C-code mostly taken from the SMART tool http://www.dmi.unict.it/~faro/smart/ v.13.02).