A curated list of awesome high performance computing resources.
- El Capitan - 2023, AMD-based, ~1.5 exaflops
- Tianhe-3 - 2022, ~700 Petaflop (Linpack500)
- History of Supercomputing (Wikipedia)
- History of Parallel Computing (Wikipedia)
- History of the Top500 (Wikipedia)
- History of LLNL Computing
- The Supermen: The Story of Seymour Cray ... (1997)
- Unmatched - 50 Years of Supercomputing (2023)
- alpaka: The alpaka library is a header-only C++17 abstraction library for accelerator development
- async-rdma: A framework for writing RDMA applications with high-level abstraction and asynchronous APIs
- CAF: An Open Source Implementation of the Actor Model in C++
- Chapel: A Programming Language for Productive Parallel Computing on Large-scale Systems
- Charm++: Parallel Programming with Migratable Objects
- Cilk Plus: C/C++ Extension for Data and Task Parallelism
- Codon: high-performance Python compiler that compiles Python code to native machine code without any runtime overhead
- CUDA: High performance NVIDIA GPU acceleration
- dask: Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love
- DeepSpeed: is an easy-to-use deep learning optimization software suite that enables unprecedented scale and speed for Deep Learning Training and Inference
- DeterminedAI: Distributed deep learning
- FastFlow: High-performance Parallel Patterns in C++
- Galois: A C++ Library to Ease Parallel Programming with Irregular Parallelism
- Halide: a language for fast, portable computation on images and tensors
- Heteroflow: Concurrent CPU-GPU Task Programming using Modern C++
- highway: performance portable SIMD intrinsics
- HIP: HIP is a C++ Runtime API and Kernel Language for AMD/Nvidia GPU
- HPC-X: Nvidia implementation of MPI
- HPX: A C++ Standard Library for Concurrency and Parallelism
- Horovod: distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet
- Implicit SPMD Program Compiler (ISPC): An open-source compiler for high-performance SIMD programming on the CPU and GPU
- Intel ISPC: SPMD compiler
- Intel TBB: Threading Building Blocks
- joblib: data-flow programming for performance (python)
- Kompute: The general purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends)
- Kokkos: A C++ Programming Model for Writing Performance Portable Applications on HPC platforms
- Kubeflow MPI Operator
- Legate: Nvidia replacement for numpy based on Legion
- Legion: Distributed heterogenous programming librrary
- MAGMA: next generation linear algebra (LA) GPU accelerated libraries
- Microsoft MPI
- MOGSLib: User defined schedulers
- mpi4jax: zero-copy mpi for jax arrays
- mpi4py: python bindings for MPI
- MPI: Message passing interface; OpenMPI implementation
- MPI: Message passing interface: MPICH implementation
- MPI Standardization Forum
- MPAVICH: Implementation of MPI
- NCCL: The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking
- NVIDIA cuNumeric: GPU drop-in for numpy
- NVIDIA stdpar: GPU accelerated C++
- numba: Numba is an open source JIT compiler that translates a subset of Python into fast machine code.
- oneAPI: open, cross-industry, standards-based, unified, multiarchitecture, multi-vendor programming model
- OpenACC: "OpenMP for GPUs"
- OpenCilk: MIT continuation of Cilk Plus
- OpenMP: Multi-platform Shared-memory Parallel Programming in C/C++ and Fortran
- PVM: Parallel Virtual Maschine: A predecessor to MPI for distributed computing
- PMIX
- Pollux: Message Passing Cloud orchestrator
- Pyfi: distributed flow and computation system
- RAJA: architecture and programming model portability for HPC applications
- RaftLib: A C++ Library for Enabling Stream and Dataflow Parallel Computation
- ray: scale AI and Python workloads — from reinforcement learning to deep learning
- ROCM: first open-source software development platform for HPC/Hyperscale-class GPU computing
- RS MPI: rust bindings for MPI
- Scalix: data parallel computing framework
- Simgrid: simulate cluster/HPC environments
- SkelCL: A Skeleton Library for Heterogeneous Systems
- STAPL: Standard Template Adaptive Parallel Programming Library in C++
- STLab: High-level Constructs for Implementing Multicore Algorithms with Minimized Contention
- SYCL: C++ Abstraction layer for heterogeneous devices
- Taichi: parallel programming language for high-performance numerical computations (embedded in Python with JIT support)
- Taskflow: A Modern C++ Parallel Task Programming Library
- The Open Community Runtime: Specification for Asynchronous Many Task systems
- Transwarp: A Header-only C++ Library for Task Concurrency
- Tuplex: Blazing fast python data science
- UCX: optimized production proven-communication framework
- Likwid - provides all information about the supercomputer/cluster
- LIKWID.jl - julia wrapper for likwid
- cpuid
- cpuid instruction note
- cpufetch
- gpufetch
- intel cpuinfo
- openmpi hwloc
- PRK - Parallel Research Kernels
- Flux framework
- Bright Cluster Manager
- E4S - The Extreme Scale HPC Scientific Stack
- RADIUSS - Rapid Application Development via an Institutional Universal Software Stack
- OpenHPC
- Slurm
- SGE
- Portable Batch System & OpenPBS
- Lustre Parallel File System
- GPFS
- Spack package manager for HPC/supercomputers
- Guix package manager for HPC/supercomputers
- Easybuild package manager for HPC/supercomputers
- Lmod
- Ruse
- xCat
- Warewulf
- Bluebanquise
- OpenXdMod
- LSF
- BeeGFS
- DeepOps - Nvidia GPU infrastructure and automation tools
- fpsync - fast parallel data transfer using fpart and rsync
- moosefs - distributed file system
- rocks - open-source Linux cluster distribution
- sstack - a tool to install multiple software stacks, such as Spack, EasyBuild, and Conda
- DeepOps - Infrastructure automation tools for Kubernetes and Slurm clusters with NVIDIA GPUs
- OpenOnDemand - Access your organization’s supercomputers through the web to compute from anywhere, on any device.
- XDMoD - open source tool to facilitate the management of high performance computing resources
- Globus Connect - Fast transfer of data/files between supercomputers
- Apptainer (formerly Singularity) - "the docker of HPC"
- Docker
- Kubernetes
- slurm docker cluster
- Vaex - high performance dataframes in python
- HTCondor
- grpc - high performance modern remote procedure call framework
- Charliecloud
- Jacamar-ci
- Prefect
- Apache Airflow
- HPC Rocket - submit slurm jobs in CI
- Stui slurm dashboard for the terminal
- Slurmvision slurm dashboard
- genv - GPU Environment Management
- snakemake - a framework for reproducible data analysis
- ruptime - batch job monitoring
- remora - batch job monitoring
- perun - energy monitor
- Prometheus - Monitoring metrics
- Grafana - Monitoring metrics
- redun - yet another redundant workflow engine
- arbiter2 - Arbiter2 monitors and protects interactive nodes with cgroups.
- nextflow - Data-driven computational pipelines
- Summary of C/C++ debugging tools
- ddt
- totalview
- marmot MPI checker
- python debugging tools
- seer modern gui for gdb
- Summary of profing tools
- Summary of code performance analysis tools
- papi
- scalasca
- tau
- scalene
- vampir
- kerncraft
- NASA parallel benchmark suite
- The Bandwidth Benchmark
- Google benchmark
- demonspawn
- HPL benchmark
- ngstress
- Ior
- bytehound memory profiler
- Flamegraphs
- fio
- IBM Spectrum Scale Key Performance Indicators (KPI)
- Hotspot - Hotspot - the Linux perf GUI for performance analysis
- mixbench - benchmarks for CPUs and GPUs
- pmu-tools (toplev) performance tools for modern Intel CPUs
- SPEC CPU Benchmark
- STREAM Memory Bandwidth Benchmark
- Intel MPI benchmarks
- Ohio state MPI benchmarks
- hpctoolkit - performance analysis toolkit
- core-to-core-latency
- Hotspot - linux perf GUI
- speedscope - flamegraph profiler for many languages
- Differential Flamegraphs
- petsc
- ginkgo
- GSL
- Scalapack
- rapids.ai - collection of libraries for executing end-to-end data science pipelines completely in the GPU
- trilinos
- tnl project
- mimalloc memory allocator
- jemalloc memory allocator
- tcmalloc memory allocator
- Horde memory allocator
- Software utilization at UK National Supercomputing Service, ARCHER2
- Ethernet
- Infiniband
- Network topologies
- Battle of the infinibands - Omnipath vs Infiniband
- Mellanox infiniband cluster config
- RoCE - RDMA Over Converged Ethernet
- Slingshot interconnect
- CXL - Compute Express Link
- Infiniband Essentials
- Wikichip
- Microarchitecture of Intel/AMD CPUs
- Apple M1
- Apple M2
- Apple M2 Teardown
- Apply M1/M2 AMX
- Apple M3
- List of Intel processors
- List of Intel micro architectures
- Comparison of Intel processors
- Comparison of Apple processors
- List of AMD processors
- List of AMD CPU micro architectures
- Comparison of AMD architectures
- Gpu Architecture Analysis
- A trip through the Graphics Pipeline
- A100 Whitepaper
- MIG
- Gentle Intro to GPU Inner Workings
- AMD Instinct GPUs
- AMD GPU ROCm Support and OS Compatibility
- List of AMD GPUs
- Comparison of CUDA architectures
- Tales of the M1 GPU
- List of Intel GPUs
- Performance of DGX Cluster
- AWS HPC
- Azure HPC
- rescale
- vast.ai
- vultr - cheap bare metal CPU, GPU, DGX servers
- hetzner - cheap servers incl. 80-core ARM
- Ampere ARM cloud-native processors
- Scaleway
- Chameleon Cloud
- Lambda Labs
- Runpod
- The use of Microsoft Azure for high performance cloud computing – A case study
- AWS Cluster in the cloud
- AWS Parallel Cluster
- An Empirical Study of Containerized MPI and GUI Application on HPC in the Cloud
- Supercomputing Conference Student Opportunities
- SCC Student cluster competition
- Winter Classic Invitational
- Linux Cluster Institute
- Supercomputer
- Supercomputer architecture
- Computer cluster
- Comparison of Intel processors
- Comparison of Apple processors
- Comparison of AMD architectures
- Comparison of CUDA architectures
- Cache
- Google TPU
- IPMI
- FRU
- Disk Arrays
- RAID
- Cray
- Jack Dongarra - 2021 Turing Award - LINPACK, BLAS, LAPACK, MPI
- Bill Gropp - 2010 IEEE TCSC Medal for Excellence in Scalable Computing
- David Bader - built the first Linux supercomputer
- Thomas Sterling - Inventor of Beowulf cluster, ParalleX/HPX
- Seymour Cray - Inventor of the Cray Supercomputer
- Larry Smarr - HPC Application Pioneer
- Free Modern HPC Books by Victor Eijkhout
- High Performance Parallel Runtimes
- The OpenMP Common Core: Making OpenMP Simple Again
- Parallel and High Performance Computing
- Algorithms for Modern Hardware
- High Performance Computing: Modern Systems and Practices - Thomas Sterling, Maciej Brodowicz, Matthew Anderson 2017
- Introduction to High Performance Computing for Scientists and Engineers - Hager 2010
- Computer Organization and Design
- Optimizing HPC Applications with Intel Cluster Tools: Hunting Petaflops
- Introduction to High Performance Scientific Computing - Victor Eijkhout 2021
- Parallel Programming for Science and Engineering - Victor EIjkhout 2021
- Parallel Programming for Science and Engineering - HTML Version
- C++ High Performance
- Data Parallel C++ Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL
- High Performance Python
- C++ Concurrency in Action: Practical Multithreading - Anthony Williams 2012
- The Art of Multiprocessor Programming - Maurice Herlihy 2012
- Parallel Computing: Theory and Practice - Umut A. Acar 2016
- Introduction to Parallel Computing - Zbigniew J. Czech
- Practical guide to bare metal C++
- Optimizing software in C++
- Optimizing subroutines in assembly code
- Microarchitecture of Intel/AMD CPUs
- Parallel Programming with MPI
- HPC, Big Data, AI Convergence Towards Exascale: Challenge and Vision
- Introduction to parallel computing - Ananth Grama
- The Student Supercomputer Challenge Guide
- The Rust Performance Book
- E-Zines on Bash, Linux, Perf, etc - Julia Evans
- The Art of Writing Efficient Programs: An Advanced Programmer's Guide to Efficient Hardware Utilization and Compiler Optimizations Using C++ Examples
- OpenMP Examples - openmp.org
- Latest books on OpemMP - openmp.org
- Programming Massively Parallel Processors 4th Edition 2023
- Software Optimization Cookbook
- Power and Performance_ Software Analysis and Optimization
- Berkeley: Applications of Parallel Computers - Detailed course on HPC
- CS6290 High-performance Computer Architecture - Milos Prvulovic and Catherine Gamboa at George Tech
- Udacity High Performance Computing
- Parallel Numerical Algorithms
- Vanderbilt - Intro to HPC
- Illinois - Intro to HPC - Creator of PyCuda
- Archer1 Courses
- TACC tutorials
- Livermore training materials
- Xsede training materials
- Parallel Computation Math
- Introduction to High-Performance and Parallel Computing - Coursera
- Foundations of HPC 2020/2021
- Principles of Distributed Computing
- High Performance Visualization
- Temple course on building/maintaining a cluster
- Nvidia Deep Learning Course
- Coursera GPU Programming Specialization
- Coursera Fundamentals of Parallelism on Intel Architecture
- Coursera Introduction to High Performance Computing
- Archer2 Shared Memory Programming with OpenMP
- Archer2 Message-Passing Programming with MPI
- HetSys 2022 Course
- Edukamu Introduction to Supercomputing
- Heterogeneous Parallel Programming by S K
- NCSA HPC Training Moodle
- Supercomputing in plain english
- Cornell workshop
- Carpentries Incubator HPC Intro
- UL HPC School
- Introduction to High-Performance Parallel Distributed Computing using Chapel, UPC++ and Coarray Fortran
- Performance Engineering off Software Systems (MIT-OCW)
- Introduction to Parallel Computing (CMSC 498X/818X)
- Infiniband Essentials
- MpiTutorial - A fantastic mpi tutorial
- Beginners Guide to HPC
- Rookie HPC Guide
- RedHat High Performance Computing 101
- Parallel Computing Training Tutorials - Lawrence Livermore National Laboratory
- Foundations of Multithreaded, Parallel, and Distributed Programming
- Building pipelines using slurm dependencies
- Writing slurm scripts in python,r and bash
- Xsede new user tutorials
- Supercomputing in plain english
- Improving Performance with SIMD intrinsics
- Want speed? Pass by value
- Introduction to low level bit hacks
- How to write fast numerical code: An Introduction
- Lecture notes on Loop optimizations
- A practical approach to code optimization
- Software optimization manuals
- Guide into OpenMP: Easy multithreading programming for C++
- An Introduction to the Partitioned Global Address Space (PGAS) Programming Model
- Jax in 2022
- C++ Benchmarking for beginners
- Mapping MPI ranks to multiple cuda GPU
- Oak Ridge National Lab Tutorials
- How to perform large scale data processing in bioinformatics
- Step by step SGEMM in OpenCL
- Frontier User Guide
- Allocating large blocks of memory in bare-metal C programming
- Hashmap benchmarks 2022
- LLNL HPC Tutorials
- High Performance Computing: A Bird's Eye View
- The dirty secret of high performance computing
- Multiple GPUs with pytorch
- Brendan Gregg on Linux Performance
- Automatic Slurm build scripts
- Fastest unordered_map implementation / benchmarks
- Memory bandwith NapkinMath
- Avoiding Instruction Cache Misses
- Multi-GPU Programming with Standard Parallel C++
- EuroCC National Competence Center Sweden (ENCCS) HPC tutorials
- LLNL hpc tutorials
- python.org Python Performance Tips
- HPC toolset tutorial (cluster management)
- OpenMP tutorials
- CUDA best practices guide
- Understanding CPU Architecture And Performance Using LIKWID
- 32 OpenMP Traps For C++ Developers
- The Landscape of Exascale Research: A Data-Driven Literature Analysis (2020)
- The Landscape of Parallel Computing Research: A View from Berkeley
- Extreme Heterogeneity 2018: Productive Computational Science in the Era of Extreme Heterogeneity
- Programming for Exascale Computers - Will Gropp, Marc Snir
- On the Memory Underutilization: Exploring Disaggregated Memory on HPC Systems (2020)
- Advances in Parallel & Distributed Processing, and Applications (conference proceedings)
- Designing Heterogeneous Systems: Large Scale Architectural Exploration Via Simulation
- Reinventing High Performance Computing: Challenges and Opportunities (2022)
- Challenges in Heterogeneous HPC White Paper (2022)
- An Evolutionary Technical & Conceptual Review on High Performance Computing Systems (Dec 2021)
- New Horizons for High-Performance Computing (2022)
- CConfidential High-Performance Computing in the Public Cloud
- Containerisation for High Performance Computing Systems: Survey and Prospects
- Heterogeneous Computing Systems (2023)
- Myths and Legends in High-Performance Computing
- Energy-Aware Scheduling for High-Performance Computing Systems: A Survey
- Ultimate Physical limits to computation - Seth Lloyd
- Myths and Legends in High-Performance Computing
- Abstract Machine Models and Proxy Architectures for Exascale Computing, 2014, Sandia National Laboratories and Lawrence Berkeley National Laboratory
- Some thoughts on the environmental impact of High Performance Computing
- A Research Retrospective on AMD's Exascale Computing Journey
- Argonne supercomputer tour
- Containers in HPC - what they fix and what they break
- HPC Tech Shorts
- CppCon
- Create a clustering server
- Argonne national lab
- Oak Ridge National Lab
- Concurrency in C++20 and Beyond - A. Williams
- Is Parallel Programming still Hard? - P. McKenney, M. Michael, and M. Wong at CppCon 2017
- The Speed of Concurrency: Is Lock-free Faster? - Fedor G Pikus in CppCon 2016
- Expressing Parallelism in C++ with Threading Building Blocks - Mike Voss at Intel Webinar 2018
- A Work-stealing Runtime for Rust - Aaron Todd in Air Mozilla 2017
- C++11/14/17 atomics and memory model: Before the story consumes you - Michael Wong in CppCon 2015
- The C++ Memory Model - Valentin Ziegler at C++ Meeting 2014
- Sharcnet HPC
- Low Latency C++ for fun and profit
- scalane python profiler
- Kokkos lectures
- EasyBuild Tech Talk I - The ABCs of Open MPI, part 1 (by Jeff Squyres & Ralph Castain)
- The Spack 2022 Roadmap
- A Not So Simple Matter of Software | Talk by Turing Award Winner Prof. Jack Dongarra
- Vectorization/SIMD intrinsics
- New Silicon for Supercomputers: A Guide for Software Engineers
- TechTechPotato Channel
- How to write the perfect hash table
- Task based Parallelism and why it's awesome - Pedro Gonnet
- Tuning Slurm Scheduling for Optimal Responsiveness and Utilization
- Parallel Programming Models Overview (2020)
- Comparative Analysis of Kokkos and Sycl (Jeff Hammond)
- Hybrid OpenMP/MPI Programming
- Designs, Lessons and Advice from Building Large Distributed Systems - Jeff Dean (Google)
- Practical Debugging and Performance Engineering
- Resources for learning about HPC networks and storage r/HPC
- Slurm for dummies guide
- Build a cluster under 50k
- Build a Beowulf cluster
- Build a Raspberry Pi Cluster
- Puget Systems
- Lambda Systems
- Titan computers
- Temple course on building/maintaining a cluster
- Detailed reddit discussion on setting up a small cluster
- Tiny titan - build a really cool pi supercomputer
- Building an Intel HPC cluster with OpenHPC
- Reddit r/HPC post on building clusters
- Build a virtual cluster with PelicanHPC
- Building a High-performance Computing Cluster Using FreeBSD
- Supermicro GPU racks
- HPC University Careers search
- HPC wire career site
- HPC certification
- HPC SysAdmin Jobs (reddit)
- The United States Research Software Engineer Association
- NCSA Internship
- AI and Future HPC Job Prospect
- HPC sys admin career (reddit)
- 1024 Cores - Dmitry Vyukov
- The Black Art of Concurrency - Internal Pointers
- Cluster Monkey
- Johnathon Dursi
- Arm Vendor HPC blog
- HPC Notes
- Brendan Gregg Performance Blog
- Performance engineering blog
- Concurrency Freaks
- IEEE Transactions on Parallel and Distributed Systems (TPDS)
- Journal of Parallel and Distributed Computing
- ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP)
- ACM Symposium on Parallel Algorithms and Architectures (SPAA)
- SC conference (SC)
- IEEE International Parallel and Distributed Processing Symposium (IPDPS)
- International Conference on Parallel Processing (ICPP)
- IEEE High Performance Extreme Computing Conference (HPEC)
- Prace
- Xsede
- Compute Canada
- Riken CSS
- Pawsey
- International Data Corporation
- List of Federally funded research and development centers
- Amdahl's Law
- HPC Wiki
- FLOPS
- Computational complexity of math operations
- Many Task Computing
- High Throughput Computing
- Parallel Virtual Machine
- OSI Model
- Workflow management
- Compute Canada Documentation
- Network Interface Controller (NIC)
- Just in time compilation
- List of distributed computing projects
- Computer cluster
- Quasi-opportunistic supercomputing
- Limits of Computation
- Bremermann's Limit
- Concurrency patterns
- Parallel Computing
- Advanced Parallel Programming in C++
- Tools for scientific computing
- Quantum Computing for High Performance Computing
- Benchmarking data science: Twelve ways to lie with statistics and performance on parallel computers.
- Establishing the IO500 Benchmark
- NVIDIA High Performance Computing articles
- Let's write a superoptimizer
- Why I think C++ is still a desirable coding platform compared to Rust
- The State of Fortran (arxiv paper 2022)
- Build a Beowulf cluster
- libsc - Supercomputing library
- xbyak jit assembler
- cpufetch - pretty cpu info fetcher
- RRZE-HPC
- Argonne Github
- Argonne Leadership Computing Facility
- Oak Ridge National Lab Github
- Compute Canada
- HPCInfo by Jeff Hammond
- Texas Advanced Computing Center (TACC) Github
- LANL HPC Github
- Rust in HPC
- University of Buffalo - Center for Computational Research
- Center for High Performance Computing - University of Utah
- Exascale Project
- Pocket HPC Survival Guide
- HPC Summer school
- Overview of all linear algebra packages
- Latency numbers
- Nvidia HPC benchmarks
- Intel Intrinsics Guide
- AWS Cloud calculator
- Quickly benchmark C++ functions
- LLNL Software repository
- Boinc - volunteer computing projects
- Prace Training Events
- Nice discussion on FlameGraph profiling
- Nice discussion on parts of a supercomputer on reddit
- Technical Report on C++ performance
- BOINC Compute for science
- Count prime numbers using MPI
This repo started from the great curated list https://github.com/taskflow/awesome-parallel-computing