Welcome to the Dask Experiment repository! This project aims to explore and demonstrate the benefits (and potential drawbacks) of using Dask for parallel computing in Python, particularly in the context of processing large datasets.
- Overview
- Project Structure
- Getting Started
- Usage
- Performance Observations
- Optimization and Future Work
- Contributing
- License
This repository contains a set of scripts and Docker configurations designed to test and compare the performance of computational tasks when executed using Dask versus standard Python execution. The primary objective is to measure the performance impact of using Dask for parallel processing, especially for tasks involving large datasets.
dask-experiment/
│
├── Dockerfile # Docker configuration for running Dask containers
├── docker-compose.yml # Docker Compose file for managing multiple Dask services
├── main.py # Main script for running the experiment
├── no_dask.py # Script demonstrating the task without Dask
├── with_dask.py # Script demonstrating the task with Dask
└── README.md # This README file
Before you begin, ensure you have the following installed on your machine:
- Docker: Docker Installation Guide
- Docker Compose: Docker Compose Installation Guide
Clone this repository to your local machine:
git clone https://github.com/yourusername/dask-experiment.git
cd dask-experiment
The experiment can be run using Docker. Follow the steps below:
-
Build and Start the Docker Containers:
docker-compose up --build
This will start the Dask scheduler, worker, and client services.
-
Access the Dask Client Container:
Once the containers are running, open a bash shell in the client container:
docker exec -it dask-experiment-client-1 /bin/bash
-
Navigate to the Project Directory:
Inside the container, navigate to the
/app
directory:cd /app
-
Run the Experiment:
You can run the main experiment script with or without Dask:
-
With Dask:
python main.py --dask
-
Without Dask:
python main.py
-
The main.py
script includes a command-line argument --dask
to switch between using Dask for parallel processing and running the task without Dask. This allows for easy comparison of performance between the two approaches.
Based on the initial experiments:
- Without Dask: The task completed in approximately 14 seconds.
- With Dask: The task completed in approximately 38 seconds, with a warning about chunk size adjustments.
Key Insights:
- The overhead introduced by Dask may not be beneficial for smaller tasks or simpler operations.
- Dask's true benefits are more likely to be realized with larger datasets or more complex workflows.
Future improvements and potential areas for exploration include:
- Chunk Size Optimization: Experiment with different chunk sizes to improve Dask performance.
- Task Granularity: Evaluate how breaking down tasks into smaller units affects performance.
- Resource Allocation: Fine-tune CPU and memory limits for containers to better manage resources.
- Profiling: Use Dask's built-in profiling tools to gain insights into task execution and identify bottlenecks.
Contributions are welcome! If you have suggestions or improvements, feel free to open an issue or submit a pull request. Please ensure your changes are well-documented and tested.
This project is licensed under the MIT License - see the LICENSE file for details.