ThioJoe/YT-Spammer-Purge

performance bottleneck in line 975 of operations.py

virophagesp opened this issue · 6 comments

elif spamListCombinedRegex.search(combinedStringNormalized.lower()):

I am using auto-smart mode, I am unsure if this is machine dependent or the video I scanned. I benchmarked 2 copies of the function the first unaltered, the second without this elif statement. On average reply scanning was 30% faster
I changed the elif statement to print a debug message, to test how often the condition is true, it was only once in all the 3 hours of scanning
For both tests the video I scanned was the entire digital circus pilot https://www.youtube.com/watch?v=HwAPLk_sQ3w

What does this condition check?
Can this be removed?
Is there a faster way to check this condition?
Are there any conditions that guarantee that this will come out as false which can be checked before-hand to avoid running it in the first place?

if anyone other than Thiojoe knows what it does it does, information would be greatly appreciated

Well that particular filter is by far the largest I believe. It basically combines all the hard coded individual spam accounts and stuff from this repo: https://github.com/ThioJoe/YT-Spam-Lists

I've considered pruning the old entries from that list but haven't gotten around to it.

On a related note, are you using the latest 2.18.0-Beta3 version? That one saves the compiled regex filters after loading them the first time, so at least it will be faster to load the filters before the scan starts.

yes, i am using the beta build, after testing so many modifications and
waiting for the filters to load i wish i used it sooner

wait a minute, in line 408 of prepare_modes.py it adds the names from the list and converts them to uppercase but in this slow code it searches for lowercase string

wait a minute, in line 408 of prepare_modes.py it adds the names from the list and converts them to uppercase but in this slow code it searches for lowercase string

Yea there is a reason for that but it's a bit hard to explain. The confusables library basically makes a regex expression that will search for a string of characters including anything that even looks like the letters. So even though it is all made upper case, it will still look for lower case characters. But I found that it better covered the confusable characters to search for if I started with the string as upper case instead of lower case for whatever reason.

It's not like it's trying to search for upper case only patterns in the lower case'd comments.

I see