Inconsistance results in heuristic_split
cs-wangchong opened this issue · 1 comments
Fix a bug
The bug is that Ronin may split the same identifier into different results due to the term order in the set of common_terms_with_numbers
.
Reproduction
I added md5sum
into the set of common_terms_with_numbers
and then ran ronin.split("md5sum")
several times.
The splitting results were sometimes ["md5sum"]
and sometimes ["md5", "sum"]
.
Reason & Solution
I checked the code and found that the heuristic_split
function in simple_splitters.py relys on the regex expression _exceptions_re
.
The _exceptions_re
is generated from common_terms_with_numbers
without considering term order in the set.
It means that if "md5" is before "md5sum" in _exceptions_re
, the split result is ["md5", "sum"]
; If "md5sum" is before "md5" in _exceptions_re
, the split result is ["md5sum"]
.
Solution: Sort the terms by term length when generating _exceptions_re
.
_exceptions_re = re.compile(r'(' + '|'.join(sorted(common_terms_with_numbers, key=lambda term: len(term), reverse=True)) + ')', re.I)
Thank you for this, and my apologies for taking so long to reply. I think your solution (sorting by length) sounds like a good idea. I want to run some tests first but it does sound like this will be an improvement.