/PixolutionImageDownloader

Lightweight bulk image url downloader written in Python3

Primary LanguagePythonApache License 2.0Apache-2.0

Pixolution Image Downloader

Lightweight bulk image url list downloader written in Python3 by pixolution.org

It provides the following features:

  • RateLimiter with throttling max downloads per interval using a simple token bucket algorithm without queue
  • Multithreaded downloads
  • Preserves the context path of the images (http://foo.bar/imgs/abs/img.jpg is stored into img/abs/img.jpg)
  • Creates a file img_list_name.txt_errors.log containing failed images
  • Can store images into download folder tree or directly into a tar file
  • Low memory usage even with huge url lists by using BoundedExecutor that create threads in chunks
  • Download progress bar with downloads/second (using tqdm, but big memory footprint)

Develop

Install the project into your local system as symlinked source:

python3 setup.py develop

virtual environment

You should use venv when working on the project.

Run once in the project folder:

python3 -mvenv .
source bin/activate
pip3 install -r requirements.txt

Run before working:

source bin/activate

[ DO YOUR DEVELOPMENT WORK ]

deactivate

Install

Install requirements:

sudo apt install python3-setuptools python3-pip

Install the project into your local system:

cd PixolutionImageDownloader/
python3 setup.py install

After install it is available as pxl_downloader in your systems CLI. Use it like this:

pxl_downloader --threads=8 download --tarfile --ratelimit-interval=2 --ratelimit-downloads=50 samples.csv downloads/

Deinstall it with:

python3 setup.py uninstall

Tests

To run a single test use:

python3 -m unittest tests/test_download_filetree.py

To run all available tests use:

python3 -m unittest discover tests/

Use it via run.sh script in project root or with pxl_downloader command after install

user@pixolution:~$ pxl_downloader --help
usage: pxl_downloader [-h] [--threads THREADS] [--verbose]
                      {download,status} ... image_list_file download_folder

Lightweight mass image downloader written in Python3.

positional arguments:
  {download,status}  available commands
    download         Download a list of images
    status           Check the download folder and the given image list file
                     and print some stats about that
  image_list_file    A file with urls defered by newlines
  download_folder    A folder to download the images to.

optional arguments:
  -h, --help         show this help message and exit
  --threads THREADS  Number of threads to download or status check in parallel
  --verbose          Show each image url to download in stdout instead of
                     default progress bar

♥ Crafted with love in Berlin by pixolution.org ♥

Download options:

user@pixolution:~$ pxl_downloader download --help
usage: pxl_downloader download [-h] [--tarfile] [--progressbar]
                               [--ratelimit-interval RATELIMIT_INTERVAL]
                               [--ratelimit-downloads RATELIMIT_DOWNLOADS]

optional arguments:
  -h, --help            show this help message and exit
  --tarfile             Store downloaded images directly into tarfile instead
                        of file structure
  --progressbar         Show a tqdm progress bar. This needs more RAM because
                        we need to put the image file list into RAM before we
                        can start.
  --ratelimit-interval RATELIMIT_INTERVAL
                        Interval in seconds (minimum 1.0) for the rate
                        limiter. Default is 1.0 seconds.
  --ratelimit-downloads RATELIMIT_DOWNLOADS
                        Number of downloads per interval (default interval 1
                        second). If negative no rate limit is applied. Default
                        is -1