nla/outbackcdx

Handling URLs that end with *

anjackson opened this issue · 2 comments

In a wide crawl, we appear to be hitting URLs that end with *, which leads to queries to OutbackCDX that look like:

/dc?limit=1&sort=reverse&url=https%3A%2F%2Fhips.hearstapps.com%2Ftoc.h-cdn.co%2Fassets%2F16%2F46%2F3200x1600%2Flandscape-1479498518-cindy-crawford-rande-gerber-house.jpg%3Fresize%3D1200%3A*

The * on the end forces the matchType to be PREFIX and this is true even if you specify a matchType parameter, and even if the * is encoded as %2A.

For now, I'll work around it but I'd like to know how best to handle this situation in the future.

Thanks!

👍

ato commented

Oops. Looks like that's a bit of a gotcha in the design of the CDX server API.

I've implemented the solution you alluded to. Specifying matchType=exact will now stop wildcards from being expanded.