- Introduction to HPC
- Designing a cluster
- Introduction to HPC Storage
- Parallel Filesystems
- Cluster Stack Basics
- Provisioning
- Configuration Management
- Scheduling and Resource Management
- Introduction to Slurm
- Monitoring HPC systems and infrastructure components
- HPC User support
- High speed Networks
- Account Management
- LMOD
- User software management
- Node Health Check
- Spack
- Using IPMI for oob management of servers
- Problems in Scalability
- Process pinning
- Benchmarking
- Developing acceptance tests
- Using compliance testing to verify environments
- Stateless provisioning
- Debugging tools and when to use them
In order to meet the demands of high performance computing (HPC) researchers, large-scale computational and storage machines require many staff members who design, install, and maintain these systems. These HPC systems professionals include system engineers, system administrators, network administrators, storage administrators and operations staff all who face problems that are specific to high performance systems.
The ACM SIGHPC SYSPROS chapter intends to be a platform for discussing the unique challenges that come from supporting large-scale, high performance systems. We speak directly to the state of the practice of standing up and operating high performance systems with an emphasis on solutions that can be implemented by systems staff at other institutions.