PuerkitoBio/gocrawl

redirects don't have a SourceURL value

jhorwit2 opened this issue · 3 comments

I ran into a scenario where http://foobar.com does a 301 to https://foobar.com. The default flags normalize to http which is fine for urls that are the same content (not redirects), so I didn't want to turn off that normalization flag. Instead I was going to check the scheme/path/host of the source url to the current url in Filter. The issue is https://foobar.com does not have a source url of http://foobar.com which to me seems like it should.

is there another preferred way to do what I'm checking that I'm not thinking of?

mna commented

Hello Josh,

I'm not sure I fully understand what part the normalization plays in this, so I can't really answer your last question, but I think the issue you raise makes sense - a redirected URL should have its source URL set to the original one.

That's a bit hard to do though, because of how convoluted the whole design of this package is (blame old me :). There are 2 ways to send new URLs in the crawler - via Worker.enqueue and Worker.push channels. URLs found while processing a web page are sent via the push channel, and in this case have their source URL correctly set. A redirect does not process the original web page, so in this case the redirect-to URL is sent via the enqueue channel. This one is "general purpose" and URLs sent through this one don't have a source URL.

The simplest fix that might work would be to send an *URLContext in the enqueue channel for a redirection, with the source URL already set, and adapt Crawler.toURLContext accordingly (to accept a raw parameter that is already an *URLContext, in which case it just returns it as-is). However, note that there is only a single source URL for a given URL, so if there are many redirections, the redirection N would have N-1 as parent, which would not be your original URL anymore if N > 1.

If you want to try that out and send a PR if it works and solves your issue, I'd be happy to merge that in and help out along the way if you have questions.

Thanks,
Martin

I also have an issue with the source URL not set for redirection.

@jhorwit2 Are you interested in creating a PR for this?

If not, is it possible for @PuerkitoBio to fix it?

I can technically submit a patch, but when I do that, our lawyer usually asks me to add my company name to your LICENSE file. Is that OK for you? If it's OK, I'll send you a PR.

Takashi

mna commented

Hi Takashi,

Yeah I don't mind either way (PR from Josh or you with the company added to license file) as long as the license stays the same (BSD 3-clause). I may find some time to look at it eventually, but I don't want to make any promise I can't keep. I think the approach I mentioned above would work without too much added complexity, though note the caveat (single source URL, not the whole history).

Another option of course is for you to provide your own http client, overriding the package's HttpClient variable, with a custom CheckRedirect behaviour (that function receives the whole history of redirected requests).

That being said it would still make sense to fix the source URL for the redirections. As mentioned to Josh previously, if you go ahead with a PR, I'll be happy to help out if you have any questions.

Thanks,
Martin