py-aho-corasick
- Free software: MIT license
- The prototype is inspired by and borrowed from Carolyn Shen
- Pure Python implementation
- Python2 && Python3 support
- Unicode && UTF-8 encoding support
- Pickle-able serialization
I re-invented this algorithm due to the need in my work. I have checked other Aho-Corasick implementations for Python, but none of them meet my requirements.
- pyahocorasick by Wojciech Mula
- Can't use unicode in Python 2.x.
- noaho by Jeff Donner
- No support for the pickle protocol.
- acora by Stefan Behnel
- No support for a dict-like protocol to associate a value to a string key.
- ahocorasick by Danny Yoo
- I failed to build it as an egg distribution.
If you have the same issues like me, welcome abroad!
Install:
pip install py_aho_corasick
Usage:
from py_aho_corasick import py_aho_corasick # keywords only A = py_aho_corasick.Automaton(['cash', 'shew', 'ew']) text = "cashew" for idx,k,v in A.get_keywords_found(text): assert text[idx:idx+len(k)] == k # keywords and values kv = [('cash',1), ('shew',2), ('ew',3)] A = py_aho_corasick.Automaton(kv) text = "cashew" for idx,k,v in A.get_keywords_found(text): assert text[idx:idx+len(k)] == k assert v == dict(kv)[k]
Compared with pyahocorasick (C extention)
You can run the testing script to get this:
# Requirements: # pip install pyahocorasick python cmp.py
- pyahocorasick: text of 1000000 length, 50000 keywords, building time 0.20450282096862793 and searching time cost 0.18479514122009277
- py_aho_corasick: text of 1000000 length, 50000 keywords, building time 3.971259117126465 and searching time cost 10.219590902328491
- brute force: text of 1000000 length, 50000 keywords, building time 0 and searching time cost 35.875592947006226
Sorry about the poor performance :-(. Most of the time is consumed during the characters matching in Trie.
Run tests:
# testing against py2 and py3 tox
- Performance optimization