Special characters * and $ not matched in URI
sebastian-nagel opened this issue · 0 comments
sebastian-nagel commented
Section 2.2.3 Special Characters contains two examples about path matching for paths containing the special characters *
and $
. The two characters are percent-encoded in the allow/disallow rule but not encoded in the URL/URI to be matched. Looks like the robots.txt parser and matcher does not follow the examples in the RFC here and fails to match the percent-encoded characters in the rule with the unencoded ones in the URI. See the unit test below.
*
and $
are among the reserved characters in URIs (RFC 3986, section 2.2) and therefor cannot be percent-encoded without potentially changing the semantics of the URI.
diff --git a/robots_test.cc b/robots_test.cc
index 35853de..3a37813 100644
--- a/robots_test.cc
+++ b/robots_test.cc
@@ -492,6 +492,19 @@ TEST(RobotsUnittest, ID_SpecialCharacters) {
EXPECT_FALSE(
IsUserAgentAllowed(robotstxt, "FooBot", "http://foo.bar/foo/quz"));
}
+ {
+ const absl::string_view robotstxt =
+ "User-agent: FooBot\n"
+ "Disallow: /path/file-with-a-%2A.html\n"
+ "Disallow: /path/foo-%24\n"
+ "Allow: /\n";
+ EXPECT_FALSE(
+ IsUserAgentAllowed(robotstxt, "FooBot",
+ "https://www.example.com/path/file-with-a-*.html"));
+ EXPECT_FALSE(
+ IsUserAgentAllowed(robotstxt, "FooBot",
+ "https://www.example.com/path/foo-$"));
+ }
}
// Google-specific: "index.html" (and only that) at the end of a pattern is