Porting `extract_href` tool into archivertools

Question

Porting `extract_href` tool into archivertools

Opened this issue 7 years ago · 2 comments

@b5 mentioned that the extract_href tool would be a good fit within archivertools and I agree. The tool automatically scans an HTML page for links and outputs them to a file - it makes sense for us to automatically run this in the constructor of Archiver, and call Archiver.addUrl() on each of the outputs of the function.

It is currently implemented in Go, so we will need to port to Python.

Answer 1 · 2017-09-14T00:15:34.000Z

Here's my attempt at this:
https://gist.github.com/ebenp/900cea9b3f3c3b1c747667e831303555#file-extract_href

Answer 2 · 2017-09-27T23:25:42.000Z

Updated gist to extract href urls. I don't get the same number of duplicates as the go script and missing the urls of https://www.epa.gov/, which there are 3.