Some scripts to download the Wikipedia dumps (available at dumps).
./scripts/wikidump-download.sh https://dumps.wikimedia.org/eswiki/20210201/
Results are in the folder data/eswiki/20210201
.
This is what wikidump-download.sh
does. For example,to download the
pages-meta-history
dump for Spanish wikipedia (eswiki
) made on 2021-02-01
,
which is available at
https://dumps.wikimedia.org/eswiki/20210201/.
To download sizes we execute the following command:
./sizes/download_sizes.sh https://dumps.wikimedia.org/eswiki/20210201/
Results are contained in ./sizes/2021-02-01.eswiki.pages-meta-history.txt
:
$ head -n5 ./sizes/2021-02-01.eswiki.pages-meta-history.txt
eswiki-20210201-pages-meta-history1.xml-p1p1094.7z 163.8 MB
eswiki-20210201-pages-meta-history1.xml-p1095p2904.7z 174.6 MB
eswiki-20210201-pages-meta-history1.xml-p2905p4657.7z 170.4 MB
eswiki-20210201-pages-meta-history1.xml-p4658p6270.7z 159.9 MB
eswiki-20210201-pages-meta-history1.xml-p6271p8248.7z 181.0 MB
./downloadlists/make_lists.sh sizes/2021-02-01.eswiki.pages-meta-history.txt
Results are contained in ./downloadlists/2021-02-01.eswiki.pages-meta-history.txt
:
$ head -n5 ./downloadlists/2021-02-01.eswiki.pages-meta-history.txt
https://dumps.wikimedia.org/eswiki/20210201/eswiki-20210201-pages-meta-history1.xml-p1p1094.7z http://dumps.wikimedia.your.org/eswiki/20210201/eswiki-20210201-pages-meta-history1.xml-p1p1094.7z
https://dumps.wikimedia.org/eswiki/20210201/eswiki-20210201-pages-meta-history1.xml-p1095p2904.7z http://dumps.wikimedia.your.org/eswiki/20210201/eswiki-20210201-pages-meta-history1.xml-p1095p2904.7z
https://dumps.wikimedia.org/eswiki/20210201/eswiki-20210201-pages-meta-history1.xml-p2905p4657.7z http://dumps.wikimedia.your.org/eswiki/20210201/eswiki-20210201-pages-meta-history1.xml-p2905p4657.7z
https://dumps.wikimedia.org/eswiki/20210201/eswiki-20210201-pages-meta-history1.xml-p4658p6270.7z http://dumps.wikimedia.your.org/eswiki/20210201/eswiki-20210201-pages-meta-history1.xml-p4658p6270.7z
https://dumps.wikimedia.org/eswiki/20210201/eswiki-20210201-pages-meta-history1.xml-p6271p8248.7z http://dumps.wikimedia.your.org/eswiki/20210201/eswiki-20210201-pages-meta-history1.xml-p6271p8248.7z
./scripts/download.sh -d ./downloadlists/2021-02-01.eswiki.pages-meta-history.txt
The file will be saved in the directory ./data/eswiki/20210201
:
$ find data/eswiki/20210201 -type f | sort -V | head -n5
data/eswiki/20210201/eswiki-20210201-pages-meta-history1.xml-p1p1094.7z
data/eswiki/20210201/eswiki-20210201-pages-meta-history1.xml-p1095p2904.7z
data/eswiki/20210201/eswiki-20210201-pages-meta-history1.xml-p2905p4657.7z
data/eswiki/20210201/eswiki-20210201-pages-meta-history1.xml-p4658p6270.7z
data/eswiki/20210201/eswiki-20210201-pages-meta-history1.xml-p6271p8248.7z
Usage: download_sizes.sh [options] <dump_url>
download_sizes.sh ( -h | --help )
download_sizes.sh --version
Arguments:
<dump_url> The wikidump base url
e.g. https://dumps.wikimedia.org/enwiki/20180301/
Options:
-d, --debug Enable debug mode.
-f, --ext <ext> Extension of the files to filter [default: .7z].
-t, --filetype <filetype> Type of files to filter
[default: pages-meta-history].
-h, --help Show this help message and exits.
--version Print version and copyright information.
Usage: make_lists.sh [options] <sizefile>
Options:
-d, --debug Enable debug mode.
-h, --help Show this help message and exits.
--version Print version and copyright information.
Usage: download.sh [options] <downloadlist>
download.sh --kill
download.sh ( -h | --help )
download.sh --version
Arguments:
<dowloadlist> Date to download (e.g. 2018-03-01)
Options:
-c, --continue Continue the previous download.
-d, --debug Enable debug mode (incompatible with --quiet).
-k, --kill Kill connection.
-q, --quiet Suppress output (incompatible with --debug).
-h, --help Show this help message and exits.
--version Print version and copyright information.
This script requires Bash version 4.
The following software are needed to run the wikidump-download scripts:
-
realpath
-
aria2: a powerful command line downloader
-
docopts: a command-line interface tool for the shell
-
GNU parallel: GNU parallel is a shell tool for executing jobs in parallel
-
xidel: Command line tool to download and extract data from HTML/XML pages
Tipically, you can find
realpath
andparallel
packaged in your distro.