
Regex not working as expected: last 7 characters are not matching

VityaSchel opened this issue · 2 comments

Hello. First of all I would like to thank you for your work and adding regex. But it seems I can't match last 7 characters in onion address (before .onion).

This example works:

> ./mkp224o -d ../test "...............................................ffff"
set workdir: ../test/
in total, 1 filter
using 4 threads

But when you try to add characters after that and match last 4 chars, it just can't find it:

> ./mkp224o -d ../test "......................................................yd"
set workdir: ../test/
in total, 1 filter
using 4 threads
^Cwaiting for threads to finish... done.

I noticed that all addresses has "d" in end, so I tried matching character before it:

> ./mkp224o -d ../test "......................................................y"
set workdir: ../test/
in total, 1 filter
using 4 threads
^Cwaiting for threads to finish... done.

Not working :(

Also why is there no "$" to match end of string?

And why sometimes it replaces one char in matching string?

Most of this has been answered in #5:

matching of end of onion address is not supported because of performance optimization.
at the end of address, checksum is stored, and it's SHA3 so computing it before checking would slow down cases where checking of end of address isn't needed.
regex filter you mentioned doesn't end with $ therefore only smallest portion (2 first characters) are matched.
they all end with d because of constant version byte which is included in all v3 addresses (specification, [ONIONADDRESS] section).

The last 3 bytes, or the last 5 base32 characters, are not calculated until the filtering is completed so you cannot filter based off of them.

The regex $ character does work. There is an implicit ^ at the start of every regex query, so if you want to use $ as well, it needs to be a full match. Ex: .*nyan.{2}$.

As for your issue about replacing one char in the matching string, if you're referring to any of the last 5 chars, again that's expected for the reasons already mentioned. The 50th character also seems not to match the filter sometimes and I do not have an explanation for that, but that probably deserves its own github issue.