atomr_accel_cuda/lib.rs
1//! # atomr-accel-cuda
2//!
3//! GPU acceleration via the actor model. Wraps NVIDIA CUDA libraries as
4//! actors on top of [`atomr`](../atomr). See `README.md` and the
5//! architecture document under `docs/` for the full design.
6//!
7//! ## Foundation Phase F1 (current)
8//!
9//! - Two-tier supervision: [`device::DeviceActor`] (stable address) ↔
10//! [`device::ContextActor`] (owns `Arc<CudaContext>`, restartable).
11//! - [`gpu_ref::GpuRef`] with generation-token validity checks.
12//! - [`dispatcher::GpuDispatcher`] pinning actor execution to a single
13//! OS thread.
14//! - [`completion::HostFnCompletion`] for sub-microsecond stream
15//! completion via `cuLaunchHostFunc`.
16//! - [`stream::PerActorAllocator`] as the default §5.7 strategy.
17//! - [`kernel::BlasActor`] performing cuBLAS SGEMM as the canonical
18//! demo.
19//!
20//! Phases F2–F5 (cuDNN, cuFFT, NCCL, TensorRT, the `PythonGpuBridge`)
21//! and the four blueprint sub-crates are deferred.
22
23// Subjective clippy lints that fight the actor-message design:
24// * `type_complexity` — actor messages and kernel envelopes return
25// tuples of typed `Arc<CudaSlice<T>>` keep-alives; refactoring to
26// `type` aliases would worsen the public API.
27// * `too_many_arguments` — kernel-launcher fns mirror the underlying
28// CUDA library entry points (cuDNN conv, cuSPARSE SpMV) which take
29// 8–10 args; collapsing to a config struct just moves the fields.
30// * `arc_with_non_send_sync` — CUDA driver handles (CudaGraph,
31// cudnnHandle) are `!Send` by design and only ever shared inside
32// the producing actor.
33// * `large_enum_variant` — kernel-message enums have one large
34// conv-descriptor variant; boxing it would fragment the hot path.
35#![allow(
36 clippy::type_complexity,
37 clippy::too_many_arguments,
38 clippy::arc_with_non_send_sync,
39 clippy::large_enum_variant,
40 // Many `unsafe` FFI shims below intentionally elide the `# Safety`
41 // doc — invariants are documented at the module level alongside the
42 // matching `cudarc::*::sys` types.
43 clippy::missing_safety_doc,
44 // Phase 0 introduced typed-dispatch BlasMsg::Gemm; `BlasMsg::Sgemm`
45 // remains as a deprecated back-compat alias used by examples /
46 // benches / migration tests. The crate intentionally calls its own
47 // deprecated surface during the deprecation window.
48 deprecated,
49 // `Drop` on owned-by-Arc handles is safe; the explicit `drop()` in
50 // a few places is documentation, not behaviour.
51 clippy::drop_non_drop,
52 // Internal-only `len`/`is_empty` symmetry isn't load-bearing for
53 // dispatch traits.
54 clippy::len_without_is_empty,
55 clippy::vec_init_then_push,
56 clippy::not_unsafe_ptr_arg_deref,
57 dead_code,
58 unused_macros
59)]
60
61pub mod completion;
62pub mod device;
63pub mod dispatcher;
64pub mod dtype;
65pub mod error;
66pub mod event;
67pub mod gpu_ref;
68pub mod graph;
69/// Phase 5: Hopper / Blackwell host-side primitives. The module
70/// surface is always compiled (the `tma::TensorMapDescriptor` builder
71/// and `cluster::LaunchSpec` types are useful even on hosts that don't
72/// link a Hopper driver). The `hopper` cargo feature gates the FFI
73/// implementations of `cuTensorMapEncodeTiled` / `cudaLaunchKernelExC`.
74pub mod hopper;
75pub mod host;
76pub mod kernel;
77pub mod memory;
78#[cfg(feature = "nvrtc")]
79pub mod module;
80#[cfg(feature = "nccl")]
81pub mod multi_device;
82pub mod nvrtc_cache;
83#[cfg(feature = "telemetry")]
84pub mod observability;
85pub mod p2p;
86pub mod pipeline;
87pub mod placement;
88pub mod prelude;
89pub mod replay;
90pub mod stream;
91#[cfg(feature = "streams")]
92pub mod streams_pipeline;
93pub mod sys;