URLs plugin - Better link extraction
Closed this issue · 7 comments
At the moment, it finds links by searching for https?://
and going from there until it finds a space. A minor issue with this is that it works even if that is on the end of a word (e.g. derphttp://
) which is somewhat weird and could be an issue if issue #62 is implemented. The more major issue is that sometimes links are wrapped in quotes or parentheses, etc., which are currently considered to be part of the link. Obviously, it could be a bit dodgy to try parsing something like Lorem ipsum (dolor sit amet http://example.com/)
, but simply (http://example.com/)
or "http://example.com/"
should be easy enough. Perhaps take a look into how popular IRC clients do it, such as HexChat, and try to match that functionality.
Lel. I had some free time and was bored, so I came up with this.
(?P<bracket>\(|)(?P<prefix>[^a-zA-Z\n]|[\(\)]|)(?P<protocol>[a-zA-Z0-9]+)://(?P<domain>[^/:\n\s]+)(?P<port>:[0-9]+|)(?:(?(bracket)(?P<url>/[^\)\n\s]+))|)(?P<end>(?(prefix)(?P=prefix)))
Expanded (with \x), this looks like..
(?P<bracket>\(|)
(?P<prefix>[^a-zA-Z\n]|[\(\)]|)
(?P<protocol>[a-zA-Z0-9]+)
://
(?P<domain>[^/:\n\s]+)
(?P<port>:[0-9]+|)
(?:
(?(bracket)
(?P<url>/[^\)\n\s]+)
)|
)
(?P<end>
(?(prefix)
(?P=prefix)
)
)
I tested with this, and it matches everything, with two problems..
- The port also contains the ":" and I'm not sure how to fix that
- This doesn't support URLs with usernames/passwords in them for basic auth
1 | http://google.com
2 | http://google.com:22
3 | http://google.com/some/url.html
4 | "http://google.com/lerp/merp"
5 | (http://google.com/herp/derp.html)
6 | ftp://ivy.gserv.me
7 | steam://stuff/stuff/more_stuff+search%20
8 | sftp://ivy.gserv.me
9 | http://google.com"
10 | 'http://af.aewrw/arew'
11 | #irc://irc.esper.net:1234/archives#
Results:
1 | {"bracket": "", "prefix": "", "protocol": "http", "domain": "google.com"}
2 | {"bracket": "", "prefix": "", "protocol": "http", "domain": "google.com", "port": ":22"}
3 | {"bracket": "", "prefix": "", "protocol": "http", "domain": "google.com", "port": "", "url": "some/url.html"}
4 | {"bracket": "", "prefix": "\"", "protocol": "http", "domain": "google.com", "port": "", "url": "/lerp/merp", "end": "\""}
5 | {"bracket": "(", "prefix": "", "protocol": "http", "domain": "google.com", "port": "", "url": "/herp/derp.html"}
6 | {"bracket": "", "prefix": "", "protocol": "ftp", "domain": "ivy.gserv.me"}
7 | {"bracket": "", "prefix": "", "protocol": "steam", "domain": "stuff", "port": "", "url": "/stuff/more_stuff+search%20"}
8 | {"bracket": "", "prefix": "", "protocol": "sftp", "domain": "ivy.gserv.me"}
9 | {"bracket": "", "prefix": "", "protocol": "http", "domain": "google.com\""}
10 | {"bracket": "", "prefix": "\'", "protocol": "http", "domain": "af.aewrw", "port": "", "url": "/arew", "end": "\|"}
11 | {"bracket": "", "prefix": "#", "protocol": "irc", "domain": "irc.esper.net", "port": ":1234", "url": "/archives", "end": "#"}
I personally use the following regex plus python code:
# These series of regexes are meant to work together. Alone they might lead to mismatches.
# The entire regex looks like this: (?:(\w+?)(?::\/\/))?([\w\.]+)\.([a-z]{2,16})(?::(\+?[1-9]\d{1,4}))?(\/[^ ]+)?
RE_PROTOCOL = r"(?:([\w\-_]+?)(?::\/\/))?" # (1) match the protocol, strip :// e.g. https or ssh
RE_DOMAIN = r"([\w\-_\.]+)" # (2) match the domain, and any subdomains. e.g. www.google, google, or maps.google
RE_TLD = r"\.([a-z]{2,16})" # (3) The top level domain, does not account for unicode. e.g. com, info, org
RE_PORT = r"(?::(\+?[1-9]\d{1,4}))?" # (4) The port. e.g. google.com:100 extracts 100
RE_PATH = r"(\/[^ ]+)?" # (5) Get everything after the top level domain / port starting with a slash. e.g. /hi.html
URL_REGEX = re.compile(RE_PROTOCOL + RE_DOMAIN + RE_TLD + RE_PORT + RE_PATH, re.I)
# in class context
def _parse(self):
"""Parse an URL, filling all variables with their respective values."""
match = URL_REGEX.match(self.url)
if not match:
return False
groups = match.groups()
subdom = groups[1].split('.')
dom = subdom[len(subdom) - 1] if len(subdom) != 0 else None
if dom:
subdom.remove(dom)
self.protocol = groups[0]
self.subdomains = subdom
self.domain = dom
self.tld = groups[2]
self.port = groups[3]
self.path = groups[4]
It is not entirely matching what you'd like, but that's how I went around detecting URLs.
Simplified what I had a bit.
(?P<prefix>[^\w\s\n]|)(?P<protocol>[\w]+)://(?P<domain>[^/:\n\s]+)(?P<port>:[0-9]+|)(?P<url>[^\s\n]+[^(?=prefix)]|)
(?P<prefix>[^\w\s\n]|)
(?P<protocol>[\w]+)
://
(?P<domain>[^/:\n\s]+)
(?P<port>:[0-9]+|)
(?P<url>[^\s\n]+|)
This seems fine - we can check the prefix and see if it's a bracket that's been matched, and then remove the paired one at the end - otherwise, remove the prefix from the end (both in code).
The only problem now is username:password@.
This one matches basic auth.
(?P<prefix>[^\w\s\n]|)
(?P<protocol>[\w]+)
://
(?P<basic>[\w]+:[\w]+|)(?:@|)
(?P<domain>[^/:\n\s]+)
(?P<port>:[0-9]+|)
(?P<url>/[^\s\n]+|)
The only thing that needs to be thought about now is unicode.