Inconsistance results in heuristic_split

Question

Inconsistance results in heuristic_split

cs-wangchong opened this issue 3 years ago · 1 comments

Fix a bug

The bug is that Ronin may split the same identifier into different results due to the term order in the set of common_terms_with_numbers.

Reproduction

I added md5sum into the set of common_terms_with_numbers and then ran ronin.split("md5sum") several times.
The splitting results were sometimes ["md5sum"] and sometimes ["md5", "sum"].

Reason & Solution

I checked the code and found that the heuristic_split function in simple_splitters.py relys on the regex expression _exceptions_re.
The _exceptions_re is generated from common_terms_with_numbers without considering term order in the set.
It means that if "md5" is before "md5sum" in _exceptions_re, the split result is ["md5", "sum"]; If "md5sum" is before "md5" in _exceptions_re, the split result is ["md5sum"].

Solution: Sort the terms by term length when generating _exceptions_re.

_exceptions_re = re.compile(r'(' + '|'.join(sorted(common_terms_with_numbers, key=lambda term: len(term), reverse=True)) + ')', re.I)

Answer 1 · 2022-07-17T03:59:52.000Z

Thank you for this, and my apologies for taking so long to reply. I think your solution (sorting by length) sounds like a good idea. I want to run some tests first but it does sound like this will be an improvement.