Skip to main content

atomr_accel_cuda/
lib.rs

1//! # atomr-accel-cuda
2//!
3//! GPU acceleration via the actor model. Wraps NVIDIA CUDA libraries as
4//! actors on top of [`atomr`](../atomr). See `README.md` and the
5//! architecture document under `docs/` for the full design.
6//!
7//! ## Foundation Phase F1 (current)
8//!
9//! - Two-tier supervision: [`device::DeviceActor`] (stable address) ↔
10//!   [`device::ContextActor`] (owns `Arc<CudaContext>`, restartable).
11//! - [`gpu_ref::GpuRef`] with generation-token validity checks.
12//! - [`dispatcher::GpuDispatcher`] pinning actor execution to a single
13//!   OS thread.
14//! - [`completion::HostFnCompletion`] for sub-microsecond stream
15//!   completion via `cuLaunchHostFunc`.
16//! - [`stream::PerActorAllocator`] as the default §5.7 strategy.
17//! - [`kernel::BlasActor`] performing cuBLAS SGEMM as the canonical
18//!   demo.
19//!
20//! Phases F2–F5 (cuDNN, cuFFT, NCCL, TensorRT, the `PythonGpuBridge`)
21//! and the four blueprint sub-crates are deferred.
22
23// Subjective clippy lints that fight the actor-message design:
24// * `type_complexity` — actor messages and kernel envelopes return
25//   tuples of typed `Arc<CudaSlice<T>>` keep-alives; refactoring to
26//   `type` aliases would worsen the public API.
27// * `too_many_arguments` — kernel-launcher fns mirror the underlying
28//   CUDA library entry points (cuDNN conv, cuSPARSE SpMV) which take
29//   8–10 args; collapsing to a config struct just moves the fields.
30// * `arc_with_non_send_sync` — CUDA driver handles (CudaGraph,
31//   cudnnHandle) are `!Send` by design and only ever shared inside
32//   the producing actor.
33// * `large_enum_variant` — kernel-message enums have one large
34//   conv-descriptor variant; boxing it would fragment the hot path.
35#![allow(
36    clippy::type_complexity,
37    clippy::too_many_arguments,
38    clippy::arc_with_non_send_sync,
39    clippy::large_enum_variant,
40    // Many `unsafe` FFI shims below intentionally elide the `# Safety`
41    // doc — invariants are documented at the module level alongside the
42    // matching `cudarc::*::sys` types.
43    clippy::missing_safety_doc,
44    // Phase 0 introduced typed-dispatch BlasMsg::Gemm; `BlasMsg::Sgemm`
45    // remains as a deprecated back-compat alias used by examples /
46    // benches / migration tests. The crate intentionally calls its own
47    // deprecated surface during the deprecation window.
48    deprecated,
49    // `Drop` on owned-by-Arc handles is safe; the explicit `drop()` in
50    // a few places is documentation, not behaviour.
51    clippy::drop_non_drop,
52    // Internal-only `len`/`is_empty` symmetry isn't load-bearing for
53    // dispatch traits.
54    clippy::len_without_is_empty,
55    clippy::vec_init_then_push,
56    clippy::not_unsafe_ptr_arg_deref,
57    dead_code,
58    unused_macros
59)]
60
61pub mod completion;
62pub mod device;
63pub mod dispatcher;
64pub mod dtype;
65pub mod error;
66pub mod event;
67pub mod gpu_ref;
68pub mod graph;
69/// Phase 5: Hopper / Blackwell host-side primitives. The module
70/// surface is always compiled (the `tma::TensorMapDescriptor` builder
71/// and `cluster::LaunchSpec` types are useful even on hosts that don't
72/// link a Hopper driver). The `hopper` cargo feature gates the FFI
73/// implementations of `cuTensorMapEncodeTiled` / `cudaLaunchKernelExC`.
74pub mod hopper;
75pub mod host;
76pub mod kernel;
77pub mod memory;
78#[cfg(feature = "nvrtc")]
79pub mod module;
80#[cfg(feature = "nccl")]
81pub mod multi_device;
82pub mod nvrtc_cache;
83#[cfg(feature = "telemetry")]
84pub mod observability;
85pub mod p2p;
86pub mod pipeline;
87pub mod placement;
88pub mod prelude;
89pub mod replay;
90pub mod stream;
91#[cfg(feature = "streams")]
92pub mod streams_pipeline;
93pub mod sys;