MPI-based distributed downloading tool for retrieving data from diverse domains.
This MPI-based distributed downloader was initially designed for the purpose of downloading all images from the
monthly GBIF occurrence snapshot. The overall setup is general enough that
it could be transformed into a functional tool beyond just our use; it should work on any list of URLs. We chose to
build this tool instead of using something like img2dataset to better avoid
overloading source servers (GBIF documents approximately 200M images across 545 servers) and have more control over the
final dataset construction and metadata management (e.g., using HDF5 as discussed
in issue #1).
- Install Miniconda
- Create a new conda environment:
conda env create -f environment.yaml --solver=libmamba -y
- Install Python 3.10 or higher
- Install MPI, any MPI should work, tested with OpenMPI and IntelMPI. Installation instructions can be found on official websites:
- Install required package:
- For general use:
pip install git+https://github.com/Imageomics/distributed-downloader - For development:
pip install -e .[dev]
- For general use:
distributed-downloader utilizes multiple nodes on a High Performance Computing (HPC) system (specifically, an HPC
with slurm workload manager) to download a collection of images specified in a given tab-delimited text file.
There are one manual step to get the downloader running as designed:
You need to call function download_images from package distributed_downloader with the config_path as an argument.
This will initialize filestructure in the output folder, partition the input file, profile the servers for their
possible download speed, and start downloading images. If downloading didn't finish, you can call the same function with
the same config_path argument to continue downloading.
Downloader has two logging profiles:
INFO- logs only the most important information, for example when a batch is started and finished. It also logs out any error that occurred during download, image decoding, or writing batch to the filesystemDEBUG- logs all information, for example logging start and finish of each downloaded image.
After downloading is finished, you can use the tools package perform various operations on them.
To do this, you need to call the function apply_tools from package distributed_downloader with the config_path
and tool_name as an argument.
Following tools are available:
resize- resizes images to a new sizeimage_verification- verifies images by checking if they are corruptedduplication_based- removes duplicate imagessize_based- removes images that are too small
You can also add your own tool, the instructions are in the section below.
You can also add your own tool by creating 3 classes and registering them with respective decorators.
- Each tool's output will be saved in separate folder in
{config.output_structure.tools_folder}/{tool_name} - There are 3 steps in the tool pipeline:
filter,schedulerandrunner.filter- filters the images that should be processed by the tool and creates csv files with themscheduler- creates a schedule for processing the images for MPIrunner- processes the images using MPI
- Each step should be implemented in a separate class.
- Tool name should be the same across all classes.
- Each tool should inherit from
ToolsBaseclass. - Each tool should have a
runmethod that will be called by the main script. - Each tool should be registered with a decorator from a respective package (
FilterRegisterfromfiltersetc.)
All scripts can expect to have the following custom environment variables, specific variables are only initialized when respective tool is called:
- General parameters
CONFIG_PATHACCOUNTPATH_TO_INPUTPATH_TO_OUTPUTOUTPUT_URLS_FOLDEROUTPUT_LOGS_FOLDEROUTPUT_IMAGES_FOLDEROUTPUT_SCHEDULES_FOLDEROUTPUT_PROFILES_TABLEOUTPUT_IGNORED_TABLEOUTPUT_INNER_CHECKPOINT_FILEOUTPUT_TOOLS_FOLDER
- Specific for downloader
DOWNLOADER_NUM_DOWNLOADSDOWNLOADER_MAX_NODESDOWNLOADER_WORKERS_PER_NODEDOWNLOADER_CPU_PER_WORKERDOWNLOADER_HEADERDOWNLOADER_IMAGE_SIZEDOWNLOADER_LOGGER_LEVELDOWNLOADER_BATCH_SIZEDOWNLOADER_RATE_MULTIPLIERDOWNLOADER_DEFAULT_RATE_LIMIT
- Specific for tools
TOOLS_NUM_WORKERSTOOLS_MAX_NODESTOOLS_WORKERS_PER_NODETOOLS_CPU_PER_WORKERTOOLS_THRESHOLD_SIZETOOLS_NEW_RESIZE_SIZE