/openmp-matrix-optimization

Comparison of parallel matrix multiplication methods using OpenMP, focusing on cache efficiency, runtime, and performance analysis with Intel VTune.

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Matrix Multiplication Optimization Project

A compact yet powerful demonstration of matrix multiplication optimizations using cache blocking, memory alignment, loop unrolling, and multi-threading (OpenMP).

Highlights

  • Naive vs. Optimized
    Compare a simple triple-nested loop (matmul_naive.c) against optimized approaches (cache-blocked, aligned, unrolled).

  • Multi-threading
    All methods support OpenMP for parallel execution and improved CPU utilization.

  • Analysis
    Profiling with Intel VTune plus custom scripts yields metrics on:

    • Execution Time
    • Speedup
    • L1/LLC Cache Miss Rates
    • CPU Utilization

Directory Overview

  • src/

    • Core implementations (matmul_naive.c, matmul_blocked.c, etc.)
    • test_matmul.c for validation and performance checks
  • logs/

    • Recorded performance data (cache miss rates, CPU usage)
  • graphs/

    • Plots illustrating key performance metrics (shown below)
  • scripts/

    • Automation and visualization scripts (e.g., cache_analysis_draw.py, compare_threading.py)
  • report/

    • Methodology, results, and conclusions in a concise PDF/Markdown document

Detailed Graphs

Below are all of the generated PNGs, separated by metric and matrix size.

1) CPU Utilization

Matrix Size = 1024
CPU Utilization (Matrix Size = 1024)

Matrix Size = 2048
CPU Utilization (Matrix Size = 2048)

Matrix Size = 4096
CPU Utilization (Matrix Size = 4096)


2) Execution Time

Matrix Size = 1024
Execution Time (Matrix Size = 1024)

Matrix Size = 2048
Execution Time (Matrix Size = 2048)

Matrix Size = 4096
Execution Time (Matrix Size = 4096)


3) L1-dcache Miss Percentage

Matrix Size = 1024
L1 dcache Miss % (Matrix Size = 1024)

Matrix Size = 2048
L1 dcache Miss % (Matrix Size = 2048)

Matrix Size = 4096
L1 dcache Miss % (Matrix Size = 4096)


4) LLC-load Miss Percentage

Matrix Size = 1024
LLC Miss % (Matrix Size = 1024)

Matrix Size = 2048
LLC Miss % (Matrix Size = 2048)

Matrix Size = 4096
LLC Miss % (Matrix Size = 4096)


5) Speedup

Matrix Size = 1024
Speedup (Matrix Size = 1024)

Matrix Size = 2048
Speedup (Matrix Size = 2048)

Matrix Size = 4096
Speedup (Matrix Size = 4096)


Conclusion

By integrating cache blocking, memory alignment, loop unrolling, and multi-threading, we significantly reduce cache misses and boost CPU utilization. Check out the logs for raw data, graphs for visual insights, and the report folder for a comprehensive discussion of these results.