Software Stack
Unlike typical computing environments, interacting with a supercomputer like Hunter requires familiarity with batch-oriented workflows and resource scheduling systems. Users do not run applications interactively or through graphical interfaces. Instead, computational jobs are submitted via schedulers such as SLURM or PBS, specifying node count, runtime limits, memory requirements, and software modules. These batch jobs are then dispatched across thousands of nodes, often in tightly coupled parallel configurations.
Hunter runs a custom Linux distribution tuned specifically for HPC. The operating system kernel is configured for NUMA-aware scheduling, low-latency interconnect support, and secure multi-user job execution. Applications are developed primarily in languages like Fortran, C++, and Python, using parallelization models such as MPI, OpenMP, and HIP. Commercial software is uncommon on systems like Hunter due to licensing constraints and poor scalability across many-core nodes. Instead, the majority of workloads rely on domain-specific, custom-developed research code.
The software environment includes the HPE Cray Supercomputing Programming Environment (CPE)—a suite of compilers, debuggers, libraries, and tuning tools designed for scaling applications efficiently across heterogeneous architectures. Users can perform in-depth profiling, vectorization analysis, and memory access optimization using Cray’s integrated toolchain, in conjunction with AMD ROCm tools and third-party utilities. Performance tuning is not optional: even marginal inefficiencies can translate to thousands of wasted core-hours at scale.
To manage the system, HLRS uses HPE Performance Cluster Manager (PCM), a high-resolution monitoring and orchestration framework. PCM provides real-time insights into system health, thermal status, workload distribution, and power usage across nodes. It also supports dynamic job scheduling, failure detection, and workload migration to maintain high availability and resource utilization.
A key innovation in Hunter is the integration of dynamic power capping, a software-based energy management feature developed by HPE in close collaboration with HLRS. This system continuously analyzes the power profiles of running applications and dynamically redistributes power budgets across nodes and jobs to maintain a global consumption ceiling. The aim is to maximize throughput per watt without violating system-wide energy limits.
This approach was previously piloted on HLRS’s Hawk system, where it yielded approximately 20% energy savings across workloads, with minimal performance degradation. On Hunter, dynamic power capping will be implemented at a deeper level, taking advantage of tighter CPU-GPU coupling and more granular telemetry from the Instinct MI300A APUs. The result is a system that not only delivers high sustained performance, but does so with a strong focus on energy efficiency—a critical metric as HPC approaches exascale and environmental constraints tighten.