Handling URLs that end with *
anjackson opened this issue · 2 comments
anjackson commented
In a wide crawl, we appear to be hitting URLs that end with *
, which leads to queries to OutbackCDX that look like:
/dc?limit=1&sort=reverse&url=https%3A%2F%2Fhips.hearstapps.com%2Ftoc.h-cdn.co%2Fassets%2F16%2F46%2F3200x1600%2Flandscape-1479498518-cindy-crawford-rande-gerber-house.jpg%3Fresize%3D1200%3A*
The *
on the end forces the matchType
to be PREFIX
and this is true even if you specify a matchType
parameter, and even if the *
is encoded as %2A
.
For now, I'll work around it but I'd like to know how best to handle this situation in the future.
Thanks!
anjackson commented
👍
ato commented
Oops. Looks like that's a bit of a gotcha in the design of the CDX server API.
I've implemented the solution you alluded to. Specifying matchType=exact will now stop wildcards from being expanded.