Url regex
0x0ece opened this issue · 0 comments
0x0ece commented
I think there are a couple of problems with url regex.
I've found issues especially with "strange chars" at the end of urls, such as this url: http://t.co/iTyBIiBB)
where ) should not be part of the url.
I've checked at the js lib, but it seems Twitter made significant updates, so I couldn't easily understand the changes to apply.
As for the python version, the previous example can be fixed with:
- REGEXEN['valid_url_path_ending_chars'] = re.compile(ur'[a-z0-9\)=#\/]', re.IGNORECASE)
+ REGEXEN['valid_url_path_ending_chars'] = re.compile(ur'[a-z0-9=#\/]', re.IGNORECASE)
...
REGEXEN['valid_url'] = re.compile(u'''
(%s)
(
(https?:\/\/|www\.)
(%s)
- (/%s*%s?)?
+ (/%s*%s)?
(\?%s*%s)?
)
But of course this does not ensure that other bugs are fixed...
Best, E.