Porting `extract_href` tool into archivertools
Opened this issue · 2 comments
jeffreyliu commented
@b5 mentioned that the extract_href
tool would be a good fit within archivertools and I agree. The tool automatically scans an HTML page for links and outputs them to a file - it makes sense for us to automatically run this in the constructor of Archiver, and call Archiver.addUrl()
on each of the outputs of the function.
It is currently implemented in Go, so we will need to port to Python.
ebenp commented
Here's my attempt at this:
https://gist.github.com/ebenp/900cea9b3f3c3b1c747667e831303555#file-extract_href
ebenp commented
Updated gist to extract href urls. I don't get the same number of duplicates as the go script and missing the urls of https://www.epa.gov/, which there are 3.