/py-aho-corasick

A Pure Python implementation of Aho-Corasick algorithm.

Primary LanguagePythonMIT LicenseMIT

py-aho-corasick

Documentation Status Updates

py-aho-corasick

  • Free software: MIT license
  • The prototype is inspired by and borrowed from Carolyn Shen

Features

  • Pure Python implementation
  • Python2 && Python3 support
  • Unicode && UTF-8 encoding support
  • Pickle-able serialization

Background

I re-invented this algorithm due to the need in my work. I have checked other Aho-Corasick implementations for Python, but none of them meet my requirements.

  • Can't use unicode in Python 2.x.
  • No support for the pickle protocol.
  • No support for a dict-like protocol to associate a value to a string key.
  • I failed to build it as an egg distribution.

If you have the same issues like me, welcome abroad!

Usage

Install:

pip install py_aho_corasick

Usage:

from py_aho_corasick import py_aho_corasick

# keywords only
A = py_aho_corasick.Automaton(['cash', 'shew', 'ew'])
text = "cashew"
for idx,k,v in A.get_keywords_found(text):
    assert text[idx:idx+len(k)] == k

# keywords and values
kv = [('cash',1), ('shew',2), ('ew',3)]
A = py_aho_corasick.Automaton(kv)
text = "cashew"
for idx,k,v in A.get_keywords_found(text):
    assert text[idx:idx+len(k)] == k
    assert v == dict(kv)[k]

Performance

Compared with pyahocorasick (C extention)

You can run the testing script to get this:

# Requirements:
# pip install pyahocorasick
python cmp.py
  • pyahocorasick: text of 1000000 length, 50000 keywords, building time 0.20450282096862793 and searching time cost 0.18479514122009277
  • py_aho_corasick: text of 1000000 length, 50000 keywords, building time 3.971259117126465 and searching time cost 10.219590902328491
  • brute force: text of 1000000 length, 50000 keywords, building time 0 and searching time cost 35.875592947006226

Sorry about the poor performance :-(. Most of the time is consumed during the characters matching in Trie.

Development

Run tests:

# testing against py2 and py3
tox

TODO

  • Performance optimization