google/robotstxt

Special characters * and $ not matched in URI

sebastian-nagel opened this issue · 0 comments

Section 2.2.3 Special Characters contains two examples about path matching for paths containing the special characters * and $. The two characters are percent-encoded in the allow/disallow rule but not encoded in the URL/URI to be matched. Looks like the robots.txt parser and matcher does not follow the examples in the RFC here and fails to match the percent-encoded characters in the rule with the unencoded ones in the URI. See the unit test below.

* and $ are among the reserved characters in URIs (RFC 3986, section 2.2) and therefor cannot be percent-encoded without potentially changing the semantics of the URI.

diff --git a/robots_test.cc b/robots_test.cc
index 35853de..3a37813 100644
--- a/robots_test.cc
+++ b/robots_test.cc
@@ -492,6 +492,19 @@ TEST(RobotsUnittest, ID_SpecialCharacters) {
     EXPECT_FALSE(
         IsUserAgentAllowed(robotstxt, "FooBot", "http://foo.bar/foo/quz"));
   }
+  {
+    const absl::string_view robotstxt =
+        "User-agent: FooBot\n"
+        "Disallow: /path/file-with-a-%2A.html\n"
+        "Disallow: /path/foo-%24\n"
+        "Allow: /\n";
+    EXPECT_FALSE(
+        IsUserAgentAllowed(robotstxt, "FooBot",
+                           "https://www.example.com/path/file-with-a-*.html"));
+    EXPECT_FALSE(
+        IsUserAgentAllowed(robotstxt, "FooBot",
+                           "https://www.example.com/path/foo-$"));
+  }
 }
 
 // Google-specific: "index.html" (and only that) at the end of a pattern is