High-Performance Computing
This section contains the specifications for Cobre’s hybrid MPI+OpenMP parallelization strategy. Where the Architecture section defines how the solver is structured at the software level, the HPC specs define how that architecture maps onto distributed and shared-memory hardware: the process topology that places one MPI rank per NUMA domain, the OpenMP threading layer that parallelizes LP solves within each rank, the communication patterns that synchronize cuts and bounds across ranks, the memory layout that eliminates false sharing and minimizes NUMA penalties, and the deployment recipes that translate these design choices into SLURM job scripts.
Together, these 8 specs fully describe the parallel execution model of the Cobre solver. They are written at the behavioral level – specifying distribution strategies, synchronization contracts, memory placement policies, and communication protocols – without prescribing MPI implementation details or OpenMP pragma syntax. A developer implementing the HPC layer should be able to read these specs alongside the referenced architecture and math specs and produce a correct parallel implementation without ambiguity. Where performance-critical design decisions exist (e.g., static contiguous block distribution over dynamic dispatch, sequential opening evaluation over parallel, thread-trajectory affinity), each spec documents the rationale and the constraints that drove the choice.
The specs assume familiarity with the SDDP training loop as described in Training Loop, the stage LP structure from LP Formulation, and the solver workspace infrastructure from Solver Workspaces. Readers new to the parallel execution model should start with Hybrid Parallelism, which establishes the two-level ferrompi+OpenMP architecture, before diving into the individual specs.
Reading Order
The specs have cross-references, so reading order matters. The following sequence builds concepts from the parallelization model outward to deployment:
- Work Distribution – How forward pass scenarios and backward pass trial points are distributed across MPI ranks and OpenMP threads: static contiguous block assignment, thread-trajectory affinity, and the load balancing strategy.
- Hybrid Parallelism – The two-level parallelization model: ferrompi as the backbone for process-level parallelism and shared memory, OpenMP via C FFI for intra-rank threading, design rationale, configuration, initialization sequence, and build integration.
- Communication Patterns – MPI communication patterns used during the SDDP training loop:
MPI_Allreducefor bound aggregation,MPI_Allgathervfor cut synchronization, and shared memory windows for intra-node data sharing. - Memory Architecture – Memory layout and NUMA-aware allocation: per-rank memory budget, shared memory region sizing, first-touch initialization, and false sharing avoidance.
- Shared Memory Aggregation – Hierarchical cut aggregation within MPI ranks sharing a physical node: node-local aggregation before inter-node communication to reduce message volume.
- Synchronization – Synchronization barriers and coordination points in the SDDP training loop: per-stage backward pass barriers, forward pass completion, and thread synchronization within ranks.
- Checkpointing – Checkpoint and restart for fault tolerance: what state is saved, when checkpoints are taken, how the solver resumes from a checkpoint across MPI ranks.
- SLURM Deployment – SLURM job submission and configuration: job scripts for single-node and multi-node runs, resource allocation, environment variable setup, and performance monitoring.
Spec Index
| Spec | Description | Architecture Reference |
|---|---|---|
| Work Distribution | Static block distribution, thread-trajectory affinity, load balancing | Training Loop, SDDP Algorithm |
| Hybrid Parallelism | ferrompi + OpenMP architecture, design rationale, configuration, initialization | Training Loop, Solver Workspaces |
| Communication Patterns | MPI collectives, shared memory windows, cut synchronization protocols | Cut Management Implementation, Convergence Monitoring |
| Memory Architecture | NUMA-aware allocation, memory budget, shared region sizing, false sharing avoidance | Solver Workspaces |
| Shared Memory Aggregation | Node-local cut aggregation, hierarchical reduction, shared memory scenarios | Cut Management Implementation |
| Synchronization | Per-stage barriers, forward pass completion, thread coordination | Training Loop |
| Checkpointing | Checkpoint/restart, fault tolerance, state serialization across ranks | Cut Management Implementation, CLI and Lifecycle |
| SLURM Deployment | Job scripts, resource allocation, environment setup, multi-node deployment | CLI and Lifecycle |
Conventions
All specs in this section describe parallel execution behavior and resource placement rather than sequential algorithmic logic. Where a spec references mathematical quantities (cut coefficients, dual variables, trial points), it uses the notation from Notation Conventions and links to the relevant math spec for the full derivation. Where a spec references architectural components (training loop, solver workspaces, cut management), it links to the relevant architecture spec for the behavioral contract. The HPC specs complement rather than duplicate the architecture specs – the architecture specs define what happens at each algorithmic step, while the HPC specs define how that step is distributed, synchronized, and placed in memory.