python-validators/validators

Valid URLs failing validation - query and fragment parts

danherbriley opened this issue · 2 comments

Hello,

i just was in the mid of creating a bug report as well (thanks for raising this 👍) for previously valid URLs which are now reported as invalid after the update from 0.20.0 to 0.21.2 (still seen also after the update to 0.22.0 to include the fix from #289 for #288).

These are the cases we're seeing as reported to be invalid while they are actually (to the best of my knowledge, haven't checked all of them) valid:

https://launchpad.support.sap.com/#/notes/2718993
https://forums.livezilla.net/index.php?/topic/10983-fg-vd-19-086-livezilla-server-is-vulnerable-to-sql-injection-ii/
http://www.brocade.com/en/backend-content/pdf-page.html?/content/dam/common/documents/content-types/security-bulletin/brocade-security-advisory-2016-168.pdf
https://groups.google.com/forum/?fromgroups#!topic/rubyonrails-security/8SA-M3as7A8
https://forum.bitdefender.com/index.php?/topic/75470-doubleagent/
https://www.smartftp.com/forums/index.php?/topic/16425-smartftp-client-40-change-log/
https://exchange.xforce.ibmcloud.com/#/vulnerabilities/100912
https://support.microsoft.com/en-us/lifecycle#gp/Microsoft-Internet-Explorer
https://www.watchguard.com/support/release-notes/fireware/12/en-US/EN_ReleaseNotes_Fireware_12_5_9/index.html#Fireware/en-US/resolved_issues.html
https://code.wireshark.org/review/#/c/25660/
https://groups.google.com/forum/#!topic/kubernetes-announce/yBrFf5nmvfI
https://issues.sonatype.org/plugins/servlet/mobile#issue/NEXUS-16870
http://forums.livezilla.net/index.php?/topic/163-livezilla-changelog/
https://www.watchguard.com/support/release-notes/fireware/11/en-US/EN_ReleaseNotes_Fireware_11_12_1/index.html#Fireware/en-US/resolved_issues.html%3FTocPath%3D_____13
https://review.typo3.org/#/c/37013
https://forums.malwarebytes.org/index.php?/topic/158251-malwarebytes-anti-exploit-hall-of-fame/
http://speedtouch.sourceforge.io/index.php?/news.en.html
https://support.k7computing.com/index.php?/Knowledgebase/Article/View/173/41/advisory-issued-on-6th-november-2017
http://support.novell.com/cgi-bin/search/searchtid.cgi?/10077872.htm

Here's the patch:

�diff --git a/src/validators/url.py b/src/validators/url.py
index 16698b1..16259e7 100644
--- a/src/validators/url.py
+++ b/src/validators/url.py
@@ -3,7 +3,7 @@
 # standard
 from functools import lru_cache
 import re
-from urllib.parse import unquote, urlsplit
+from urllib.parse import parse_qs, unquote, urlsplit
 
 # local
 from .hostname import hostname
@@ -34,11 +34,6 @@ def _path_regex():
     )
 
 
-@lru_cache
-def _query_regex():
-    return re.compile(r"&?(\w+=?[^\s&]*)", re.IGNORECASE)
-
-
 def _validate_scheme(value: str):
     """Validate scheme."""
     # More schemes will be considered later.
@@ -108,16 +103,23 @@ def _validate_netloc(
     ) and _validate_auth_segment(basic_auth)
 
 
-def _validate_optionals(path: str, query: str, fragment: str):
+def _validate_optionals(
+    path: str,
+    query: str,
+    fragment: str,
+    strict_query: bool = False
+):
     """Validate path query and fragments."""
     optional_segments = True
     if path:
         optional_segments &= bool(_path_regex().match(path))
-    if query:
-        optional_segments &= bool(_query_regex().match(query))
+    if query and parse_qs(query, strict_parsing=strict_query):
+        optional_segments &= True
     if fragment:
         fragment = fragment.lstrip("/") if fragment.startswith("/") else fragment
-        optional_segments &= all(char_to_avoid not in fragment for char_to_avoid in ("/", "?"))
+        optional_segments &= all(
+            char_to_avoid not in fragment for char_to_avoid in ("?",)
+        )
     return optional_segments
 
 
@@ -130,6 +132,7 @@ def url(
     skip_ipv4_addr: bool = False,
     may_have_port: bool = True,
     simple_host: bool = False,
+    strict_query: bool = True,
     rfc_1034: bool = False,
     rfc_2782: bool = False,
 ):
@@ -167,6 +170,8 @@ def url(
             URL string may contain port number.
         simple_host:
             URL string maybe only hyphens and alpha-numerals.
+        strict_query:
+            Fail validation on query string parsing error.
         rfc_1034:
             Allow trailing dot in domain/host name.
             Ref: [RFC 1034](https://www.rfc-editor.org/rfc/rfc1034).
@@ -214,5 +219,5 @@ def url(
             rfc_1034,
             rfc_2782,
         )
-        and _validate_optionals(path, query, fragment)
+        and _validate_optionals(path, query, fragment, strict_query)
     )

After applying, you'll have to pass strict_query = False, to get those URLs validated.

PR is welcome.