michaelrsweet/htmldoc

why does link mapping add the http server port?

Closed this issue · 2 comments

Hello again and again,

with the same debug report in #507, this line

DEBUG: Mapping "../source/" to "http://distro.ibiblio.org:80/fatdog/web/../source/"...

makes me suspect that when htmldoc maps URLs it replaces the input document base URL (which is http://distro.ibiblio.org/fatdog/web/ from http://distro.ibiblio.org/fatdog/web/index.html in this case) with the URL it gets from the HTTP protocol response.
While the resulting mapped link still works, I think it would be prudent not to use the base URL from the HTTP protocol response because 1) it includes the port number, which could be temporary, 2) it could be a temporary redirect. In both cases the mapped links in the output document would break, should the server port or temporary redirect change in future.

What do you think?

So from the standpoint of "spidering" a web site, HTMLDOC is just taking a snapshot and I would expect the base URL to be stable for the duration of the run. Longer term it is possible for URLs to start breaking, but there really isn't anything HTMLDOC can do about that... Any links that end up resolving "locally" (to HTML files you include in the document) will be mapped to local links and not the base URL. Similarly, images are embedded or copied.

Thank you for your reply.

Any links that end up resolving "locally" (to HTML files you include in the document) will be mapped to local links and not the base URL. Similarly, images are embedded or copied.

Hmm, I was talking about links external to the document, for which I don't include a local file, but under the base URL. Say that I want to create a pdf file of an online file index or FAQ index.