Python script to automate webarchiving with wget
, wpull
or httrack
.
Clone the repository, create a virtual enviroment (currently Python 3.5+), do pip install -r requirements.txt
.
To use the HTTrack engine, the respective package needs to be installed. Many distributions have it in their official repositories, e.g. Debian (apt-get install httrack
). The application will try to convert the downloaded file to WARC automatically, although this process will fail in case Java is not installed on the machine.
warc-webarchiving.py
is the main script, config\default.conf
the default configuration file, config\example.xlsx
the default table.
usage: warc-webarchiving [-h] [--engine {httrack,wget,wpull}] [-c FILE] [-v] [-d]
{run,import,export}
Script to automate webarchiving with wget.
positional arguments:
{run,import,export} run warc-webarchiving or work with database
optional arguments:
-h, --help show this help message and exit
--engine {httrack,wget,wpull}
choose engine for archving (default: wget); overridden
by database
-c FILE, --config FILE
custom path to user config file (default:
./config/default.conf)
-v, --verbose enable verbose mode and get verbose info
-d, --debug enable debug mode and get debug info
Please note that sites that are mirrored using the HTTrack engine, are put in the subdirectory ./src/httrack
to match the behavior of the other two engines. In case the conversion using the NLA's httrack2warc
fails (or is not to be used), the original files will nevertheless be only inside this directory.
In this example, the table example.xlsx
(referred to in default.conf
) is being imported and a download with the default engine (wget
) is started. Afterwards, the SQLite database is written into another sheet inside example.xlsx
.
python warc-webarchiving.py import
python warc-webarchiving.py run
python warc-webarchiving.py export
This project can be run indepently of local Python versions by using Docker and Docker Compose. Both import
and run
are executed when started with docker-compose up --build
.
Configuration is a bit patchy when using Docker as of now. Inside the respective .conf
, both downloaddir = ./WARC
as well as excelfile = ./config/
must not be changed - the specific Excel file has to be named though. Instead, most configuration is handled by Docker's .env
file: DOWNLOAD_VOLUME
is mapped into downloaddir
inside the .conf
. Also, the arguments --config
and --engine
are handled by Docker.
Copyright (c) 2019 Rasmus Kriest The code in this project is licensed under MIT license.