dryan/twitter-text-py

Url regex

0x0ece opened this issue · 0 comments

I think there are a couple of problems with url regex.
I've found issues especially with "strange chars" at the end of urls, such as this url: http://t.co/iTyBIiBB)
where ) should not be part of the url.

I've checked at the js lib, but it seems Twitter made significant updates, so I couldn't easily understand the changes to apply.

As for the python version, the previous example can be fixed with:

- REGEXEN['valid_url_path_ending_chars'] = re.compile(ur'[a-z0-9\)=#\/]', re.IGNORECASE)
+ REGEXEN['valid_url_path_ending_chars'] = re.compile(ur'[a-z0-9=#\/]', re.IGNORECASE)

...

REGEXEN['valid_url'] = re.compile(u'''
    (%s)
    (
        (https?:\/\/|www\.)
        (%s)
-        (/%s*%s?)?
+        (/%s*%s)?
        (\?%s*%s)?
    )

But of course this does not ensure that other bugs are fixed...
Best, E.