Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

High-Performance Computing

This section contains the specifications for Cobre’s hybrid MPI+OpenMP parallelization strategy. Where the Architecture section defines how the solver is structured at the software level, the HPC specs define how that architecture maps onto distributed and shared-memory hardware: the process topology that places one MPI rank per NUMA domain, the OpenMP threading layer that parallelizes LP solves within each rank, the communication patterns that synchronize cuts and bounds across ranks, the memory layout that eliminates false sharing and minimizes NUMA penalties, and the deployment recipes that translate these design choices into SLURM job scripts.

Together, these 8 specs fully describe the parallel execution model of the Cobre solver. They are written at the behavioral level – specifying distribution strategies, synchronization contracts, memory placement policies, and communication protocols – without prescribing MPI implementation details or OpenMP pragma syntax. A developer implementing the HPC layer should be able to read these specs alongside the referenced architecture and math specs and produce a correct parallel implementation without ambiguity. Where performance-critical design decisions exist (e.g., static contiguous block distribution over dynamic dispatch, sequential opening evaluation over parallel, thread-trajectory affinity), each spec documents the rationale and the constraints that drove the choice.

The specs assume familiarity with the SDDP training loop as described in Training Loop, the stage LP structure from LP Formulation, and the solver workspace infrastructure from Solver Workspaces. Readers new to the parallel execution model should start with Hybrid Parallelism, which establishes the two-level ferrompi+OpenMP architecture, before diving into the individual specs.

Reading Order

The specs have cross-references, so reading order matters. The following sequence builds concepts from the parallelization model outward to deployment:

  1. Work Distribution – How forward pass scenarios and backward pass trial points are distributed across MPI ranks and OpenMP threads: static contiguous block assignment, thread-trajectory affinity, and the load balancing strategy.
  2. Hybrid Parallelism – The two-level parallelization model: ferrompi as the backbone for process-level parallelism and shared memory, OpenMP via C FFI for intra-rank threading, design rationale, configuration, initialization sequence, and build integration.
  3. Communication Patterns – MPI communication patterns used during the SDDP training loop: MPI_Allreduce for bound aggregation, MPI_Allgatherv for cut synchronization, and shared memory windows for intra-node data sharing.
  4. Memory Architecture – Memory layout and NUMA-aware allocation: per-rank memory budget, shared memory region sizing, first-touch initialization, and false sharing avoidance.
  5. Shared Memory Aggregation – Hierarchical cut aggregation within MPI ranks sharing a physical node: node-local aggregation before inter-node communication to reduce message volume.
  6. Synchronization – Synchronization barriers and coordination points in the SDDP training loop: per-stage backward pass barriers, forward pass completion, and thread synchronization within ranks.
  7. Checkpointing – Checkpoint and restart for fault tolerance: what state is saved, when checkpoints are taken, how the solver resumes from a checkpoint across MPI ranks.
  8. SLURM Deployment – SLURM job submission and configuration: job scripts for single-node and multi-node runs, resource allocation, environment variable setup, and performance monitoring.

Spec Index

SpecDescriptionArchitecture Reference
Work DistributionStatic block distribution, thread-trajectory affinity, load balancingTraining Loop, SDDP Algorithm
Hybrid Parallelismferrompi + OpenMP architecture, design rationale, configuration, initializationTraining Loop, Solver Workspaces
Communication PatternsMPI collectives, shared memory windows, cut synchronization protocolsCut Management Implementation, Convergence Monitoring
Memory ArchitectureNUMA-aware allocation, memory budget, shared region sizing, false sharing avoidanceSolver Workspaces
Shared Memory AggregationNode-local cut aggregation, hierarchical reduction, shared memory scenariosCut Management Implementation
SynchronizationPer-stage barriers, forward pass completion, thread coordinationTraining Loop
CheckpointingCheckpoint/restart, fault tolerance, state serialization across ranksCut Management Implementation, CLI and Lifecycle
SLURM DeploymentJob scripts, resource allocation, environment setup, multi-node deploymentCLI and Lifecycle

Conventions

All specs in this section describe parallel execution behavior and resource placement rather than sequential algorithmic logic. Where a spec references mathematical quantities (cut coefficients, dual variables, trial points), it uses the notation from Notation Conventions and links to the relevant math spec for the full derivation. Where a spec references architectural components (training loop, solver workspaces, cut management), it links to the relevant architecture spec for the behavioral contract. The HPC specs complement rather than duplicate the architecture specs – the architecture specs define what happens at each algorithmic step, while the HPC specs define how that step is distributed, synchronized, and placed in memory.