/dask-experiment

testing dask

Primary LanguagePython

Dask Experiment Repository

Welcome to the Dask Experiment repository! This project aims to explore and demonstrate the benefits (and potential drawbacks) of using Dask for parallel computing in Python, particularly in the context of processing large datasets.

Table of Contents

Overview

This repository contains a set of scripts and Docker configurations designed to test and compare the performance of computational tasks when executed using Dask versus standard Python execution. The primary objective is to measure the performance impact of using Dask for parallel processing, especially for tasks involving large datasets.

Project Structure

dask-experiment/
│
├── Dockerfile                    # Docker configuration for running Dask containers
├── docker-compose.yml            # Docker Compose file for managing multiple Dask services
├── main.py                       # Main script for running the experiment
├── no_dask.py                    # Script demonstrating the task without Dask
├── with_dask.py                  # Script demonstrating the task with Dask
└── README.md                     # This README file

Getting Started

Prerequisites

Before you begin, ensure you have the following installed on your machine:

Installation

Clone this repository to your local machine:

git clone https://github.com/yourusername/dask-experiment.git
cd dask-experiment

Usage

Running the Experiment

The experiment can be run using Docker. Follow the steps below:

  1. Build and Start the Docker Containers:

    docker-compose up --build

This will start the Dask scheduler, worker, and client services.

  1. Access the Dask Client Container:

    Once the containers are running, open a bash shell in the client container:

    docker exec -it dask-experiment-client-1 /bin/bash
  2. Navigate to the Project Directory:

    Inside the container, navigate to the /app directory:

    cd /app
    
  3. Run the Experiment:

    You can run the main experiment script with or without Dask:

    • With Dask:

      python main.py --dask
    • Without Dask:

      python main.py

Switching Between Dask and Non-Dask Execution

The main.py script includes a command-line argument --dask to switch between using Dask for parallel processing and running the task without Dask. This allows for easy comparison of performance between the two approaches.

Performance Observations

Based on the initial experiments:

  • Without Dask: The task completed in approximately 14 seconds.
  • With Dask: The task completed in approximately 38 seconds, with a warning about chunk size adjustments.

Key Insights:

  • The overhead introduced by Dask may not be beneficial for smaller tasks or simpler operations.
  • Dask's true benefits are more likely to be realized with larger datasets or more complex workflows.

Optimization and Future Work

Future improvements and potential areas for exploration include:

  • Chunk Size Optimization: Experiment with different chunk sizes to improve Dask performance.
  • Task Granularity: Evaluate how breaking down tasks into smaller units affects performance.
  • Resource Allocation: Fine-tune CPU and memory limits for containers to better manage resources.
  • Profiling: Use Dask's built-in profiling tools to gain insights into task execution and identify bottlenecks.

Contributing

Contributions are welcome! If you have suggestions or improvements, feel free to open an issue or submit a pull request. Please ensure your changes are well-documented and tested.

License

This project is licensed under the MIT License - see the LICENSE file for details.