Table Of Contents
This is an implementation of the Boyer-Moore substring search algorithm in pure python.
It is a shameless copy-paste of the python reference code provided here, with modifications to support the following additional features:
- Searching in files without reading the whole file into memory, allowing handling of large files
- Full unicode support
See the API documentation for more details.
Install from pip
.
pip install boyermoore
>>> from boyermoore import search_file >>> >>> offsets = search_file("pattern!", "file.txt") # Find all occurrences of "pattern!" in file "file.txt" >>> offsets # Display found occurrences [12, 456, 10422] # Pattern occurs at byte offsets 12, 456, and 104222
>>> from boyermoore import search_file >>> >>> offsets = search_file("pattern!", "file.txt", greedy=False) # Find the first occurrence of "pattern!" in file "file.txt" >>> offsets # Display found occurrences [12] # First occurrence of pattern is at byte offset 12
The following section illustrates the average speed of the boyermoore.search_file
function when searching for a unicode string in files of sizes ranging from 32MB to 4GB.
The test was executed using Python 3.7.6 on a Windows 10 system with an Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz and 32 GB of RAM.
The test searches for all occurrences of a fixed unicode string in a series of test files. The unicode string is:
Hello नमस्ते Привет こんにちは
("Hello" in English, followed by the Hindi translation, followed by the Russian translation, followed by the Japanese translation)
Each test file has 2 occurrences of the unicode string, one at the very beginning (byte offset of 0) and one at the very end (byte offset of [file_length - pattern_length]).
The following table shows the times taken to search for all occurences of the unicode string "Hello नमस्ते Привет こんにちは" inside test files of various sizes.
File size | Time (seconds) |
---|---|
32 MB | 0.31 |
64 MB | 0.55 |
128 MB | 1.07 |
256 MB | 1.96 |
512 MB | 3.87 |
1 GB | 7.56 |
4 GB | 31.01 |
Contributions are welcome, please open a pull request at https://github.com/eriknyquist/boyermoore and ensure that:
- All existing unit tests pass (run tests via
python setup.py test
) - New unit tests are added to cover any modified/new functionality (run
python code_coverage.py
to ensure that coverage is above 98%)
You will need to install packages required for development, these are listed in dev_requirements.txt
:
pip install -r dev_requirements.txt
If you have any questions about / need help with contributions or unit tests, please contact Erik at eknyquist@gmail.com.