Python wrapper of the Enhanced Suffix Array (ESA).
This is a version adapted for Python, originating from the C++ implementation found here.
pip install esaxx-py
from esa import esaxx
def print_snippet(T, beg, length):
for i in range(length):
c = T[beg + i]
print("_" if c.isspace() else c, end="")
T = 'abracadabra'
SA = [0]*len(T)
L = [0]*len(T)
R = [0]*len(T)
D = [0]*len(T)
k = 0x100
node_num = 0
node_num = esaxx(T, SA, L, R, D, len(T), k, node_num)
if node_num == -1:
exit()
print(f"node:{node_num}")
for i in range(node_num):
print(f"{i}\t{R[i] - L[i]}\t{D[i]}\t", end="")
print_snippet(T, SA[L[i]], D[i])
print()
node:5
0 2 4 abra
1 5 1 a
2 2 3 bra
3 2 2 ra
4 11 0
Alternatively, you can use enumSubstring.py:
ehco abracadabra | python enumSubstring.py
For the original implementation:
g++ enumSubstring.cpp
echo abracadabra | ./a.out
In the original implementation, the return value of esaxx was an error code, not node_num. However, due to the constraints of Python and the difficulty in passing by reference, I've chosen to return node_num.
To obtain Maximal Substrings:
from esa import esaxx
def print_snippet(T, beg, length):
for i in range(length):
c = T[beg + i]
print("_" if c.isspace() else c, end="")
T = 'abracadabra'
SA = [0]*len(T)
L = [0]*len(T)
R = [0]*len(T)
D = [0]*len(T)
k = 0x100
node_num = 0
node_num = esaxx(T, SA, L, R, D, len(T), k, node_num)
if node_num == -1:
exit()
size = len(T)
# Record changes in BWT
rank = [0] * size
r = 0
for i in range(size):
if i == 0 or T[(SA[i] + size - 1) % size] != T[(SA[i - 1] + size - 1) % size]:
r += 1
rank[i] = r
print("count\tlength\tstring")
# Enumerate maximal partial strings
for i in range(node_num):
if D[i] == 0 or (rank[R[i] - 1] - rank[L[i]] == 0):
continue
print(f"{R[i] - L[i]}\t{D[i]}\t", end="")
print_snippet(T, SA[L[i]], D[i])
print()
The first column represents the frequency of occurrence, and the second column represents the length of the string.
count length string
2 4 abra
5 1 a
Here, even strings that appear more than once are listed, even if they are just one character. If you want to skip those, you can use if len < 2: continue
.
enumMaxSubstring.py:
ehco abracadabra | python enumMaxSubstring.py
C++ (enumMaxSubstring.cpp):
g++ enumMaxSubstring.cpp
echo abracadabra | ./a.out
Introduce a new function: get_maximal_substrings(str)
.
This function allows for easier extraction of maximal substrings from a given string.
Usage Example:
from esa import get_maximal_substrings
T = 'abracadabra'
substrings = get_maximal_substrings(T)
print("count\tlength\tstring")
for substring in substrings:
print(f'{substring.count}\t{substring.length}\t{substring.string})
count length string
2 4 abra
5 1 a
Available unicode character
Usage Example:
from esa import get_maximal_substrings_unicode
T = '松島やああ松島や松島や'
substrings = get_maximal_substrings_unicode(T)
print(T)
print("count\tlength\tstring")
for substring in substrings:
print(f'{substring.count}\t{substring.length}\t{substring.string})
松島やああ松島や松島や
count length string
3 3 松島や
3 2 島や
3 1 や
2 1 あ
C++ Implementation:
- https://github.com/hillbig/esaxx
- https://code.google.com/archive/p/esaxx/
- https://github.com/TNishimoto/esaxx
Rust Version:
Software using esaxx:
- https://github.com/huggingface/tokenizers
- https://github.com/google/sentencepiece
- https://github.com/shuyo/ldig
- http://phontron.com/pialign/
List of papers using esaxx:
- https://ipsj.ixsq.nii.ac.jp/ej/index.php?active_action=repository_view_main_item_detail&page_id=13&block_id=8&item_id=47681&item_no=1
- https://www.anlp.jp/proceedings/annual_meeting/2012/pdf_dir/A3-1.pdf
- https://www.anlp.jp/proceedings/annual_meeting/2012/pdf_dir/D5-2.pdf
Articles about esaxx
: