optimaize/language-detector

MAIL_REGEX should be limited

tballison opened this issue · 0 comments

If you try to detect a string with 50000 'a's, the MAIL_REGEX in URLTextFilter takes a really, really long time.

If you add reasonable limits, the performance is much better.
private static final Pattern MAIL_REGEX = Pattern.compile("[-_.0-9A-Za-z]+@[-0-9A-Za-z]+[-.0-9A-Za-z]+");

to->

    private static final Pattern MAIL_REGEX = Pattern.compile("[-_.0-9A-Za-z]{1,250}@[-_0-9A-Za-z]{1,250}[-_.0-9A-Za-z]{1,250}");