Pycndl is a CLI to concurrently download multiple Web contents using multi-thread.
- can fast download a large number of files
- auto-retry
- log errors, stats, and progress
- easy to retry from a log
$ pip install git+https://github.com/hrsma2i/pycndl.git
$ docker pull ghcr.io/hrsma2i/pycndl
Add the followings to your ~/.bashrc
.
alias cndl='docker run --rm -v $HOME:$HOME -w $(pwd) -e GOOGLE_APPLICATION_CREDENTIALS=$HOME/.config/gcloud/application_default_credentials.json -e GOOGLE_CLOUD_PROJECT=$(gcloud config get-value project) ghcr.io/hrsma2i/pycndl cndl'
$ source ~/.bashrc
$ cndl input.json ./downloaded/ 2>&1 | tee log-`date +%Y%m%d%H%M%S`.jsonlines
The contents whose URL is in input.json
are downloaded to .downloaded/
.
input.json must have url
field like:
[
{
"url": "http://example_1.jpg"
},
{
"url": "http://example_2.png"
},
...
]
It will work even if there are fields other than url
or filename
.
They will be ignored.
JSON Lines (Newline Delimited JSON) is also supported.
Add the suffix .jsonl
or .jsonlines
to an input:
$ cndl input.jsonlines ./downloaded/
CSV is also supported.
$ cndl input.csv ./downlaoded/
More details:
$ cndl --help
You can rename files to download, setting filename
field:
[
{
"url": "http://example_1.jpg",
"filename": "foo_1.jpg"
},
...
]
http://example_1.jpg
will be downloaded as ./downloaded/foo_1.jpg
.
cndl input.json gs://bucket/downloaded
Failed URLs are automatically retried --max-retry
times.
You can also retry using the log file as the next input:
$ cat log-YYYYmmddHHMMSS.jsonlines | grep 'failed to retry downloading' | jq -c {"url": .url} > input-`date +%Y%m%d%H%M%S`.jsonlines
$ cndl input-YYYYmmddHHMMSS.jsonlines ./downloaded/ 2>&1 | tee log-`date +%Y%m%d%H%M%S`.jsonlines