Rust Tensor Libraries Benchmark

Overview

This project benchmarks three Rust tensor libraries - Burn, Candle, and NDArray - across various tensor operations and sizes. Each library is set up as an independent project to avoid dependency conflicts.

Project Structure

tensor-benchmark/
├── candle-benchmark/     # Candle library benchmarks
├── burn-benchmark/       # Burn library benchmarks
├── ndarray-benchmark/    # NDArray library benchmarks
├── readme.md
└── .gitignore

Libraries Being Tested

Candle (v0.8.4) - Minimalist ML framework for Rust
Burn (v0.17.1) - Pure Rust deep learning framework with NdArray backend
NDArray (v0.16) - Pure Rust n-dimensional arrays (baseline)

How to Run Benchmarks

Prerequisites

Rust 1.70+ installed
Cargo package manager

Running Individual Benchmarks

Each library can be benchmarked independently:

# Candle benchmarks
cd candle-benchmark
cargo bench

# Burn benchmarks
cd burn-benchmark
cargo bench

# NDArray benchmarks
cd ndarray-benchmark
cargo bench

Running All Benchmarks

# From the root directory
./scripts/run_all_benchmarks.sh  # If script exists
# Or manually:
cd candle-benchmark && cargo bench && cd ../burn-benchmark && cargo bench && cd ../ndarray-benchmark && cargo bench

What's Being Tested

Current Benchmark Operations

Tensor Creation
- Random tensor generation (various sizes)
- Benchmarks throughput in elements/second
Matrix Multiplication
- Square matrix multiplication (64x64 to 512x512)
- Measures FLOPS (floating point operations per second)
Element-wise Operations
- Addition and multiplication of tensors
- Tests vectorization efficiency
Reduction Operations
- Sum and mean calculations
- Benchmarks aggregation performance
Vector Operations (Burn & NDArray only)
- Dot product on 1D vectors
- Various vector sizes (1K to 1M elements)

Tensor Sizes Tested

Small: 64x64, 128x128, 256x256
Medium: 512x512, 1024x1024
Large: 2048x2048 (matrix multiplication only)
Vectors: 1K, 10K, 100K, 1M elements

Performance Metrics

Throughput: Elements processed per second
FLOPS: Floating point operations per second (for matrix multiplication)
Latency: Mean execution time with standard deviation
Memory efficiency: Implicit through throughput measurements

Benchmark Results

Results are generated using the Criterion benchmarking framework and include:

HTML Reports: Detailed performance graphs and statistics
Statistical Analysis: Mean, standard deviation, and confidence intervals
Regression Detection: Identifies performance regressions between runs
Comparison Plots: Visual comparison between libraries

Viewing Results

After running benchmarks, HTML reports are generated in each project's target/criterion/ directory:

# Example: View Candle benchmark results
open candle-benchmark/target/criterion/index.html

# Or for all projects
open */target/criterion/index.html

Implementation Details

Common Tensor Operations Interface

All three libraries implement the same TensorBenchmark trait:

pub trait TensorBenchmark {
    type Tensor;

    fn create_random_tensor(&self, shape: &[usize]) -> Self::Tensor;
    fn add(&self, a: &Self::Tensor, b: &Self::Tensor) -> Self::Tensor;
    fn multiply(&self, a: &Self::Tensor, b: &Self::Tensor) -> Self::Tensor;
    fn matmul(&self, a: &Self::Tensor, b: &Self::Tensor) -> Self::Tensor;
    fn sum(&self, tensor: &Self::Tensor) -> f32;
    fn mean(&self, tensor: &Self::Tensor) -> f32;
    // ... other operations
}

Hardware Configuration

CPU: Multi-threaded execution (uses all available cores)
Data Type: f32 (32-bit floating point)
Memory: System RAM (no GPU acceleration currently)
Optimization: Release mode with LTO enabled

Future Enhancements

Planned additions for more comprehensive benchmarking:

GPU acceleration support (CUDA, Metal, ROCm)
Additional tensor operations (convolution, pooling, etc.)
Mixed precision benchmarking (f16, bf16)
Memory usage profiling
Batch processing optimizations
Cross-platform performance comparison

MathisWellmann/tensor-ops-benchmarks