becheran/mlc

Allow specifying HTTP request parameters

Opened this issue · 10 comments

Is your feature request related to a problem? Please describe.
Some URLs require specific HTTP request parameters.
One example is the github docs pages, for example this .md will fail:

$ cat mdtest.md 
= Test =

[Github docs link](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/licensing-a-repository)

$ mlc

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+                                                          +
+            markup link checker - mlc v0.15.2             +
+                                                          +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

[Err ] ./mdtest.md (3, 1) => https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/licensing-a-repository - 403 - Forbidden

Result (1 links):

OK       0
Skipped  0
Warnings 0
Errors   1


The following links could not be resolved:

./mdtest.md (3, 1) => https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/licensing-a-repository.

The reason is that the page requires specific HTTP headers:
community/community#14773

Describe the solution you'd like
It would be nice to have a way to specify HTTP request parameters, possibly per-URL.

I like this idea. Just don't know how exactly one would pass all the possible header fields to mlc? Via commandarg?

Probably the best option would be a config file, otherwise it would be impractical to specify different headers for different URLs.

See for example:
https://github.com/orgs/github-community/discussions/14773#discussioncomment-2679987
https://github.com/tcort/markdown-link-check#config-file-format

I think your pipeline has been hit by this bug:
https://github.com/becheran/mlc/actions/runs/3559864946/jobs/5979511630

[Err ] ./README.md (62, 22) => https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions - 403 - Forbidden
Error: https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions. 403 - Forbidden

@diegorondini fun fact: It does not fail when I run it locally. Does github somehow prevent requests to GitHub.com from their own runners? You mention missing request parameters? What would that be in this case?

@becheran I think the first question is why the pipeline checks that link even if there's no such link in the README.md:

$ grep 'docs\.github' README.md

Returning to this bug, docs.github.com requires the Accept-Encoding: zstd, br, gzip, deflate header:

$ curl -i -X GET https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions
HTTP/2 403 
x-azure-ref: 0wn2EYwAAAACr4P2HgpUzTatC1/nj5XnyTU5aMjIxMDYwNjEzMDIxADU5NmQ3OGEyLWNhNWYtNDc5ZC1iY2RjLTA4MzU4MzMxNzRiMg==
accept-ranges: bytes
via: 1.1 varnish, 1.1 varnish
date: Mon, 28 Nov 2022 09:22:10 GMT
x-served-by: cache-iad-kiad7000135-IAD, cache-mrs10563-MRS
x-cache: MISS, MISS
x-cache-hits: 0, 0
x-timer: S1669627330.213655,VS0,VE92
strict-transport-security: max-age=31557600

$ curl -i -H "Accept-Encoding: zstd, br, gzip, deflate" -X GET https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions
HTTP/2 200 
cache-control: public, max-age=60
content-type: text/html; charset=utf-8
access-control-allow-origin: *
content-security-policy: default-src 'none';prefetch-src 'self';connect-src 'self';font-src 'self' data: githubdocs.azureedge.net;img-src 'self' github.com *.github.com *.githubusercontent.com *.githubassets.com data: githubdocs.azureedge.net placehold.it;object-src 'self';script-src 'self' data: githubdocs.azureedge.net;frame-src 'self' github.com *.github.com *.githubusercontent.com *.githubassets.com https://www.youtube-nocookie.com;frame-ancestors 'self' github.com *.github.com *.githubusercontent.com *.githubassets.com;style-src 'self' 'unsafe-inline' data: githubdocs.azureedge.net;child-src 'self';upgrade-insecure-requests;base-uri 'self';form-action 'self';script-src-attr 'none'
cross-origin-opener-policy: same-origin
cross-origin-resource-policy: same-origin
x-dns-prefetch-control: off
x-frame-options: SAMEORIGIN
x-download-options: noopen
x-content-type-options: nosniff
origin-agent-cluster: ?1
x-permitted-cross-domain-policies: none
referrer-policy: strict-origin-when-cross-origin
x-xss-protection: 0
x-powered-by: Next.js
x-azure-ref: 0hXyEYwAAAADMF8jkAx/XToTRxIg5u1m/UEhMMzBFREdFMDMxOQA1OTZkNzhhMi1jYTVmLTQ3OWQtYmNkYy0wODM1ODMzMTc0YjI=
content-encoding: br
via: 1.1 varnish, 1.1 varnish
accept-ranges: bytes
date: Mon, 28 Nov 2022 09:22:29 GMT
age: 335
x-served-by: cache-iad-kiad7000135-IAD, cache-mrs10583-MRS
x-cache: CONFIG_NOCACHE, HIT, HIT
x-cache-hits: 3, 1
x-timer: S1669627349.305248,VS0,VE1
vary: Accept-Encoding
strict-transport-security: max-age=31557600
content-length: 38324

Warning: Binary output can mess up your terminal. Use "--output -" to tell 
Warning: curl to output it to your terminal anyway, or consider "--output 
Warning: <FILE>" to save to a file.

Sorry, I just realized I should have checked out the github-action-output branch.
Now it fails for me as well with 0.15.4:

$ mlc ./README.md

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+                                                          +
+            markup link checker - mlc v0.15.4             +
+                                                          +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

09:31:29 [WARN] Broken reference link: Borrowed("possible values: md, html")
09:31:29 [WARN] Strip everything after #. The chapter part '#ci-pipeline-integration' is not checked.
[ OK ] ./README.md (19, 8) => #ci-pipeline-integration - 
[ OK ] ./README.md (64, 1) => ./docs/FailingAnnotation.PNG - 
[ OK ] ./README.md (32, 28) => https://doc.rust-lang.org/cargo/ - 
[ OK ] ./README.md (4, 2) => https://badgen.net/crates/d/mlc?color=blue - 
[ OK ] ./README.md (46, 56) => https://github.com/marketplace/actions/markup-link-checker-mlc - 
[ OK ] ./README.md (20, 29) => https://rust-lang.github.io/async-book/ - 
[ OK ] ./README.md (3, 2) => https://img.shields.io/crates/v/mlc.svg?color=orange - 
[ OK ] ./README.md (9, 1) => https://asciinema.org/a/299100 - 
[ OK ] ./README.md (9, 2) => https://asciinema.org/a/299100.svg - 
[ OK ] ./README.md (6, 2) => https://img.shields.io/badge/License-MIT-yellow.svg - 
[ OK ] ./README.md (5, 2) => https://github.com/becheran/mlc/actions/workflows/rust.yml/badge.svg - 
[ OK ] ./README.md (7, 2) => https://img.shields.io/badge/PRs-welcome-brightgreen.svg - 
[ OK ] ./README.md (3, 1) => https://crates.io/crates/mlc - 
[ OK ] ./README.md (4, 1) => https://crates.io/crates/mlc - 
[ OK ] ./README.md (32, 92) => https://crates.io/crates/mlc - 
[Err ] ./README.md (62, 22) => https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions - 403 - Forbidden
[ OK ] ./README.md (144, 60) => https://github.com/becheran/mlc/blob/master/LICENSE - 
[ OK ] ./README.md (75, 32) => https://github.com/becheran/ntest/blob/master/.github/workflows/ci.yml - 
[ OK ] ./README.md (79, 37) => https://hub.docker.com/repository/docker/becheran/mlc - 
[ OK ] ./README.md (140, 14) => https://github.com/becheran/mlc/blob/master/CHANGELOG.md - 
[ OK ] ./README.md (6, 1) => https://opensource.org/licenses/MIT - 
[ OK ] ./README.md (112, 221) => https://github.com/becheran/wildmatch - 
[ OK ] ./README.md (40, 54) => https://github.com/becheran/mlc/releases - 
[ OK ] ./README.md (5, 1) => https://github.com/becheran/mlc/actions/workflows/rust.yml - 
[ OK ] ./README.md (7, 1) => https://github.com/becheran/mlc/blob/master/CONTRIBUTING.md - 

Result (25 links):

OK       24
Skipped  0
Warnings 0
Errors   1


The following links could not be resolved:

./README.md (62, 22) => https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions.

Ah, right. Did the same mistake and ran it on wrong branch locally 🤦‍♂️

@diegorondini would 'Accept-Encoding: *' help in this case? Might be a sane default since we don't care about the content anyways right now.

To make it configurable I think a map of links with wildcards and associated headers would make sense as config parameter. Will think about it.

@becheran well, not literally:

$ curl -i -H "Accept-Encoding: *" -X GET https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions
HTTP/2 403
[...]

The official way to mean any encoding should be Accept-Encoding: */*, but I don't know how much it works in pratice.
https://stackoverflow.com/questions/25182888/does-in-an-http-accepts-encoding-header-mean-gzip-is-supported

The library you're using (reqwest?) may support accepting all encodings. Libcurl does that:
https://curl.se/libcurl/c/CURLOPT_ACCEPT_ENCODING.html

Not sure though if servers that don't support compression / encoding peacefully decline the "Accept-Encoding" header.

Yes, I am using reqwest. I did turn on all supported encodings (brotli, gzip, deflate) and that did the trick for now. But I guess there are other cases where a custom request is still required. For example if a authentication token is required for a specific link.