High-Performance Computing

This section contains the specifications for Cobre’s hybrid MPI+OpenMP parallelization strategy. Where the Architecture section defines how the solver is structured at the software level, the HPC specs define how that architecture maps onto distributed and shared-memory hardware: the process topology that places one MPI rank per NUMA domain, the OpenMP threading layer that parallelizes LP solves within each rank, the communication patterns that synchronize cuts and bounds across ranks, the memory layout that eliminates false sharing and minimizes NUMA penalties, and the deployment recipes that translate these design choices into SLURM job scripts.

Together, these 8 specs fully describe the parallel execution model of the Cobre solver. They are written at the behavioral level – specifying distribution strategies, synchronization contracts, memory placement policies, and communication protocols – without prescribing MPI implementation details or OpenMP pragma syntax. A developer implementing the HPC layer should be able to read these specs alongside the referenced architecture and math specs and produce a correct parallel implementation without ambiguity. Where performance-critical design decisions exist (e.g., static contiguous block distribution over dynamic dispatch, sequential opening evaluation over parallel, thread-trajectory affinity), each spec documents the rationale and the constraints that drove the choice.

The specs assume familiarity with the SDDP training loop as described in Training Loop, the stage LP structure from LP Formulation, and the solver workspace infrastructure from Solver Workspaces. Readers new to the parallel execution model should start with Hybrid Parallelism, which establishes the two-level ferrompi+OpenMP architecture, before diving into the individual specs.

Reading Order

The specs have cross-references, so reading order matters. The following sequence builds concepts from the parallelization model outward to deployment:

Work Distribution – How forward pass scenarios and backward pass trial points are distributed across MPI ranks and OpenMP threads: static contiguous block assignment, thread-trajectory affinity, and the load balancing strategy.
Hybrid Parallelism – The two-level parallelization model: ferrompi as the backbone for process-level parallelism and shared memory, OpenMP via C FFI for intra-rank threading, design rationale, configuration, initialization sequence, and build integration.
Communication Patterns – MPI communication patterns used during the SDDP training loop: MPI_Allreduce for bound aggregation, MPI_Allgatherv for cut synchronization, and shared memory windows for intra-node data sharing.
Memory Architecture – Memory layout and NUMA-aware allocation: per-rank memory budget, shared memory region sizing, first-touch initialization, and false sharing avoidance.
Shared Memory Aggregation – Hierarchical cut aggregation within MPI ranks sharing a physical node: node-local aggregation before inter-node communication to reduce message volume.
Synchronization – Synchronization barriers and coordination points in the SDDP training loop: per-stage backward pass barriers, forward pass completion, and thread synchronization within ranks.
Checkpointing – Checkpoint and restart for fault tolerance: what state is saved, when checkpoints are taken, how the solver resumes from a checkpoint across MPI ranks.
SLURM Deployment – SLURM job submission and configuration: job scripts for single-node and multi-node runs, resource allocation, environment variable setup, and performance monitoring.

Spec Index

Spec	Description	Architecture Reference
Work Distribution	Static block distribution, thread-trajectory affinity, load balancing	Training Loop, SDDP Algorithm
Hybrid Parallelism	ferrompi + OpenMP architecture, design rationale, configuration, initialization	Training Loop, Solver Workspaces
Communication Patterns	MPI collectives, shared memory windows, cut synchronization protocols	Cut Management Implementation, Convergence Monitoring
Memory Architecture	NUMA-aware allocation, memory budget, shared region sizing, false sharing avoidance	Solver Workspaces
Shared Memory Aggregation	Node-local cut aggregation, hierarchical reduction, shared memory scenarios	Cut Management Implementation
Synchronization	Per-stage barriers, forward pass completion, thread coordination	Training Loop
Checkpointing	Checkpoint/restart, fault tolerance, state serialization across ranks	Cut Management Implementation, CLI and Lifecycle
SLURM Deployment	Job scripts, resource allocation, environment setup, multi-node deployment	CLI and Lifecycle

Conventions

All specs in this section describe parallel execution behavior and resource placement rather than sequential algorithmic logic. Where a spec references mathematical quantities (cut coefficients, dual variables, trial points), it uses the notation from Notation Conventions and links to the relevant math spec for the full derivation. Where a spec references architectural components (training loop, solver workspaces, cut management), it links to the relevant architecture spec for the behavioral contract. The HPC specs complement rather than duplicate the architecture specs – the architecture specs define what happens at each algorithmic step, while the HPC specs define how that step is distributed, synchronized, and placed in memory.

Keyboard shortcuts

Cobre Methodology Reference

High-Performance Computing

Reading Order

Spec Index

Conventions