curl/trurl

`trurl --trim scheme`?

spekulatius opened this issue · 3 comments

Hello @bagder,

Thank you for trurl!

I was wondering if trurl allows to remove the scheme (to dedup them later). I've tried this among other commands:

$ cat test
http://a.example.com/test
https://a.example.com/test

$ cat test | trurl --url-file - --trim scheme
trurl error: Unsupported trim component: scheme
trurl error: Try trurl -h for help

Expected:

$ cat test
http://a.example.com/test
https://a.example.com/test

$ cat test | trurl --url-file - --trim scheme
a.example.com/test
a.example.com/test

Set with an empty path or a space didn't lead to success.

Is there a way to drop the protocols using trurl?

Cheers,
Peter

trurl only outputs, unless you use -g or --json, valid URLs, one for each line of output.
--set, --redirect, --trim, --append, --iterate, and --sort-query, only modify the URL in a way that keeps it valid, and re-parsable by libcurl (with the current flags: --accept-space, --no-guess-scheme, etc.).

You cannot use a --trim command that outputs something without a scheme, because that is not a valid URL.

If your goal is actually to only print out only the {host} and {path} parts of the URL, you can use -g '{:host}{:path}':

$ cat test
http://a.example.com/test/foo/./bar/..
xyz.example.org
https://b.example.com:20/test?hi#hello
ftp://emanuele6@c.example.org/hey.txt
$ trurl -f - < ./test
http://a.example.com/test/foo
http://xyz.example.org/
https://b.example.com:20/test?hi#hello
ftp://emanuele6@c.example.org/hey.txt
$ trurl -f - -g '{:host}{:path}' < ./test
a.example.com/test/foo
xyz.example.org/
b.example.com:20/test
c.example.org/hey.txt

You may also use {:host}{:path}{:query}{:fragment} since {query} and {fragment} expand with ?/# at the start, but if you also want to include also other stuff like {user} and {pass} it gets tricky, because if you use -g '{:user}:{:pass}@{:host}{:path}' it gets tricky since trurl would output :@a.example.org/foo for http://a.example.org/foo which is probably not what you want.

Maybe the -g command could be improved to allow printng a full URL with some parts omitted somehow to satisfy your use case, but I don't know how that would be useful. Can you explain why you are doing this?

Anyway, as a workaround, in the specific case of removing a scheme, if you really want to remove the scheme and nothing else from a full URL for some reason, I guess you can use something like this:

$ trurl -f - < ./test | sed -n 's@^[^:]*://@@p'
a.example.com/test/foo
xyz.example.org
b.example.com:20/test?hi#hello
emanuele6@c.example.org/hey.txt
$ # or to only print http/https URLs, without the scheme
$ trurl -f - < ./test | sed -n 's@^https\{0,1\}://@@p'
a.example.com/test/foo
xyz.example.org
b.example.com:20/test?hi#hello
$ # notice that trurl guessed the scheme for xyz.example.org as http://
$ # so it is printed.

This should be fine since trurl will only output lines that contain one full valid URL, and discard invalid URLs in the input, so you can assume that the scheme will not contain colons, and removing everything before the first ":", and the "://" after that will only remove the scheme.

I was wondering if trurl allows to remove the scheme (to dedup them later)

Oh, duh. Sorry, your example also had URLs that were identical except for the scheme, so I don't know how i missed that. :p

Still, I don't understand why you are trying to only remove the scheme.

In that case, you can simply set the scheme to the desired value e.g. http:// and then pipe to sort -u or awk '!seen[$0]++', no?

$ trurl -f - -s 'scheme=http' < ./test | sort -u

If you want to do something more complex like discarding non-http/https URLs, and keeping https:// if both http:// and https:// are specified, you can use jq:

$ trurl --json -f - < ./test | jq -r 'group_by(del(.url, .scheme, .raw_port))[] | first(("https", "http") as $s | .[] | select(.scheme == $s).url)'

I'm with @emanuele6. You can do this already with a few very simple workarounds: either decide to use -g and output all parts except the scheme, or just set a fixed scheme before you compare. I think "trurl only outputs valid URLs" is a good idea to stick to.