datatogether/archivertools

Porting `extract_href` tool into archivertools

Opened this issue · 2 comments

@b5 mentioned that the extract_href tool would be a good fit within archivertools and I agree. The tool automatically scans an HTML page for links and outputs them to a file - it makes sense for us to automatically run this in the constructor of Archiver, and call Archiver.addUrl() on each of the outputs of the function.

It is currently implemented in Go, so we will need to port to Python.

ebenp commented

Updated gist to extract href urls. I don't get the same number of duplicates as the go script and missing the urls of https://www.epa.gov/, which there are 3.