UrlTextFilter finds not all urls which results into wrong detection

Question

UrlTextFilter finds not all urls which results into wrong detection

Opened this issue 7 years ago · 0 comments

I have documents with URLs and added the UrlTextFilter to remove them so I get a good language detection. But on some test data the language detection was wrong or at least with a very low accuracy.

The test document (german text) with the UrlTextFilter shows a propability of 0.15 for german and 0.7 for nl.

The URLs are rather complex with some special chars (brackets and so on) in it. After removing the URLs with a more complex regexp before sending the text to the language detector, the probability for the same text is 0.99 for german.

So I suggest you improve the regular expressions.

I'll try to provide a PR, but have to check this first...