Expand description
§atomr-accel-cuda
GPU acceleration via the actor model. Wraps NVIDIA CUDA libraries as
actors on top of atomr. See README.md and the
architecture document under docs/ for the full design.
§Foundation Phase F1 (current)
- Two-tier supervision:
device::DeviceActor(stable address) ↔device::ContextActor(ownsArc<CudaContext>, restartable). gpu_ref::GpuRefwith generation-token validity checks.dispatcher::GpuDispatcherpinning actor execution to a single OS thread.completion::HostFnCompletionfor sub-microsecond stream completion viacuLaunchHostFunc.stream::PerActorAllocatoras the default §5.7 strategy.kernel::BlasActorperforming cuBLAS SGEMM as the canonical demo.
Phases F2–F5 (cuDNN, cuFFT, NCCL, TensorRT, the PythonGpuBridge)
and the four blueprint sub-crates are deferred.
Modules§
- completion
- Completion strategies (§5.10).
- device
DeviceActor(outer tier) +ContextActor(inner tier) — §5.11.- dispatcher
GpuDispatcher(§5.1) — pinned single-thread runtime that ensures the actor’s CUDA context stays current on the same OS thread for the actor’s whole lifetime.- dtype
CudaDtype— CUDA-side dtype mappings and capability markers.- error
- Error taxonomy and the supervisor decider for context-poisoning recovery (§5.3, §5.11 of the architecture document).
- event
EventActor— typed actor surface aroundCudaEvent.- gpu_ref
GpuRef<T>— opaque, message-friendly handle to a GPU buffer (§5.8).- graph
GraphActor— record a CUDA stream-capture once, replay many.- hopper
- Phase 5: Hopper / Blackwell host-side primitives. The module
surface is always compiled (the
tma::TensorMapDescriptorbuilder andcluster::LaunchSpectypes are useful even on hosts that don’t link a Hopper driver). Thehoppercargo feature gates the FFI implementations ofcuTensorMapEncodeTiled/cudaLaunchKernelExC. Hopper / Blackwell primitives (Phase 5). - host
- Host-side support: pinned (page-locked) memory pool +
PinnedBuf<T>. - kernel
- Kernel-actor wrappers around CUDA library handles (§3.2).
- memory
- Managed (unified) memory + Phase 3 driver-API helpers.
- module
ModuleActor— load prebuilt cubin/PTX from disk (or memory) and launch its kernels.- multi_
device - Top-level multi-device actors that span multiple
DeviceActors. - nvrtc_
cache - Persistent disk cache for NVRTC-compiled CUDA kernels (Phase 0.6).
- observability
- Observability glue: install [
atomr_telemetry::TelemetryExtension] on a hostActorSystemand expose a small set of GPU-specific probes that callers feed from kernel actors / placement actors / stream allocators. - p2p
- P2P (peer-to-peer) topology + cross-device async memcpy.
- pipeline
- Multi-stream pipeline pattern.
- placement
PlacementActor— picks the best-fitDeviceActorfor each request based on a configurablePlacementPolicy.- prelude
- Common imports for users of
atomr-accel-cuda. - replay
- Deterministic-replay harness.
- stream
- Stream allocation strategies (§5.7).
- streams_
pipeline atomr-streams-based pipeline helpers — the F10 successor to the actor-basedcrate::pipeline::PipelineExecutor.- sys
sys— thin Rust wrappers overcudarc’s raw*::sysFFI for library entry points that aren’t yet exposed by the safe layer.