Valid URLs failing validation - query and fragment parts
danherbriley opened this issue · 2 comments
Hi!
I have encountered a few weird urls that are not passing validation even though they are functional.
The issue is with the validation of the query and fragment parts of the URL.
Furthermore, the query and fragment validations in url.py
do not conform to their rather loose definitions in RFC3986.
Is that intended or not? If not, I'll open a PR to fix it.
Examples:
Hello,
i just was in the mid of creating a bug report as well (thanks for raising this 👍) for previously valid URLs which are now reported as invalid after the update from 0.20.0 to 0.21.2 (still seen also after the update to 0.22.0 to include the fix from #289 for #288).
These are the cases we're seeing as reported to be invalid while they are actually (to the best of my knowledge, haven't checked all of them) valid:
https://launchpad.support.sap.com/#/notes/2718993
https://forums.livezilla.net/index.php?/topic/10983-fg-vd-19-086-livezilla-server-is-vulnerable-to-sql-injection-ii/
http://www.brocade.com/en/backend-content/pdf-page.html?/content/dam/common/documents/content-types/security-bulletin/brocade-security-advisory-2016-168.pdf
https://groups.google.com/forum/?fromgroups#!topic/rubyonrails-security/8SA-M3as7A8
https://forum.bitdefender.com/index.php?/topic/75470-doubleagent/
https://www.smartftp.com/forums/index.php?/topic/16425-smartftp-client-40-change-log/
https://exchange.xforce.ibmcloud.com/#/vulnerabilities/100912
https://support.microsoft.com/en-us/lifecycle#gp/Microsoft-Internet-Explorer
https://www.watchguard.com/support/release-notes/fireware/12/en-US/EN_ReleaseNotes_Fireware_12_5_9/index.html#Fireware/en-US/resolved_issues.html
https://code.wireshark.org/review/#/c/25660/
https://groups.google.com/forum/#!topic/kubernetes-announce/yBrFf5nmvfI
https://issues.sonatype.org/plugins/servlet/mobile#issue/NEXUS-16870
http://forums.livezilla.net/index.php?/topic/163-livezilla-changelog/
https://www.watchguard.com/support/release-notes/fireware/11/en-US/EN_ReleaseNotes_Fireware_11_12_1/index.html#Fireware/en-US/resolved_issues.html%3FTocPath%3D_____13
https://review.typo3.org/#/c/37013
https://forums.malwarebytes.org/index.php?/topic/158251-malwarebytes-anti-exploit-hall-of-fame/
http://speedtouch.sourceforge.io/index.php?/news.en.html
https://support.k7computing.com/index.php?/Knowledgebase/Article/View/173/41/advisory-issued-on-6th-november-2017
http://support.novell.com/cgi-bin/search/searchtid.cgi?/10077872.htm
Here's the patch:
�diff --git a/src/validators/url.py b/src/validators/url.py
index 16698b1..16259e7 100644
--- a/src/validators/url.py
+++ b/src/validators/url.py
@@ -3,7 +3,7 @@
# standard
from functools import lru_cache
import re
-from urllib.parse import unquote, urlsplit
+from urllib.parse import parse_qs, unquote, urlsplit
# local
from .hostname import hostname
@@ -34,11 +34,6 @@ def _path_regex():
)
-@lru_cache
-def _query_regex():
- return re.compile(r"&?(\w+=?[^\s&]*)", re.IGNORECASE)
-
-
def _validate_scheme(value: str):
"""Validate scheme."""
# More schemes will be considered later.
@@ -108,16 +103,23 @@ def _validate_netloc(
) and _validate_auth_segment(basic_auth)
-def _validate_optionals(path: str, query: str, fragment: str):
+def _validate_optionals(
+ path: str,
+ query: str,
+ fragment: str,
+ strict_query: bool = False
+):
"""Validate path query and fragments."""
optional_segments = True
if path:
optional_segments &= bool(_path_regex().match(path))
- if query:
- optional_segments &= bool(_query_regex().match(query))
+ if query and parse_qs(query, strict_parsing=strict_query):
+ optional_segments &= True
if fragment:
fragment = fragment.lstrip("/") if fragment.startswith("/") else fragment
- optional_segments &= all(char_to_avoid not in fragment for char_to_avoid in ("/", "?"))
+ optional_segments &= all(
+ char_to_avoid not in fragment for char_to_avoid in ("?",)
+ )
return optional_segments
@@ -130,6 +132,7 @@ def url(
skip_ipv4_addr: bool = False,
may_have_port: bool = True,
simple_host: bool = False,
+ strict_query: bool = True,
rfc_1034: bool = False,
rfc_2782: bool = False,
):
@@ -167,6 +170,8 @@ def url(
URL string may contain port number.
simple_host:
URL string maybe only hyphens and alpha-numerals.
+ strict_query:
+ Fail validation on query string parsing error.
rfc_1034:
Allow trailing dot in domain/host name.
Ref: [RFC 1034](https://www.rfc-editor.org/rfc/rfc1034).
@@ -214,5 +219,5 @@ def url(
rfc_1034,
rfc_2782,
)
- and _validate_optionals(path, query, fragment)
+ and _validate_optionals(path, query, fragment, strict_query)
)
After applying, you'll have to pass strict_query = False
, to get those URLs validated.
PR is welcome.