v.1.3
A Regular Expression for catching URLs and extracting fragments out of them.
RegEx = /\(?(?:(http|https|ftp):\/\/)?(?:((?:[^\W\s]|\.|-|[:]{1})+)@{1})?((?:www.)?(?:[^\W\s]|\.(?!\.)|-)+[\.](?!\.)[^\W\s]{2,4}|localhost(?=\/)|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})(?::(\d*))?([\/]?[^\s\?]*[\/]{1})*(?:\/?([^\s\n\?\[\]\{\}\#]*(?:(?=\.)){1}|[^\s\n\?\[\]\{\}\.\#]*)?([\.]{1}[^\s\?\#]*)?)?(?:\?{1}([^\s\n\#\[\]]*))?([\#][^\s\n]*)?\)?/;
Match this RegEx against a chunk of text, and catch any URL inside it. It should work on any programming language that supports Regular Expressions.
You can check JavaScript example of its use.
It captures 8 groups (plus the whole match that contains entire URL). If some of them don’t exist in URL, that group will return empty.
- $& - Entire URL - url being parsed
- $1 - Protocol - http, https, ftp
- $2 - Userinfo - username:password
- $3 - Domain - www.mydomain.com, mydomain.com, 127.0.0.1, localhost...
- $4 - Port - 80
- $5 - Path / Folders - /folder/dir/
- $6 - Page / Filename - eg. index
- $7 - File extension - .html, .php...
- $8 - Query - item=value&item2=value2
- $9 - Anchor - #home