Expand description
Kernel-actor wrappers around CUDA library handles (§3.2).
Each library actor follows a uniform shape:
- a
Real { handle, stream, completion, state, … }variant holding the cudarc handle plus the per-actor caches it needs; - a
Mockvariant for GPU-free tests; - a
props(stream, allocator, completion, state)constructor that panics with"ContextPoisoned: <Lib>::new failed: …"if the handle can’t be created, so the supervisor restarts; - a
mock_props()constructor that repliesUnrecoverable("…not supported in mock mode")to every variant.
The shared kernel-enqueue body lives in
envelope::run_kernel — every library actor calls it instead of
reimplementing the validate / enqueue / spawn-completion-await /
reply / drop-keep-alive sequence.
F2 ships: BlasActor, CudnnActor, FftActor, RngActor.
F3 adds: SolverActor, BlasLtActor, NvrtcActor.
F4 adds: CollectiveActor (NCCL).
Re-exports§
pub use dispatch::BlasLtDispatch;pub use dispatch::BlasLtDispatchCtx;pub use dispatch::CollectiveDispatch;pub use dispatch::CollectiveDispatchCtx;pub use dispatch::CudnnDispatch;pub use dispatch::CudnnDispatchCtx;pub use dispatch::DevSliceArg;pub use dispatch::GemmDispatchCtx;pub use dispatch::NvrtcDispatchCtx;pub use dispatch::NvrtcLaunchDispatch;pub use dispatch::RngDispatch;pub use dispatch::ScalarArg;pub use dispatch::FftDispatch;pub use dispatch::FftDispatchCtx;pub use dispatch::SendSparseHandle;pub use dispatch::SparseDispatch;pub use dispatch::SparseDispatchCtx;pub use dispatch::SparseOp;pub use dispatch::TensorDispatch;pub use dispatch::TensorDispatchCtx;pub use dispatch::WorkspacePool;pub use blas::AsumRequest;pub use blas::AxpyRequest;pub use blas::BlasActor;pub use blas::BlasMsg;pub use blas::CopyRequest;pub use blas::DotRequest;pub use blas::GeamRequest;pub use blas::GemmRequest;pub use blas::GemmStridedBatchedRequest;pub use blas::GemvRequest;pub use blas::GerRequest;pub use blas::IamaxRequest;pub use blas::IaminRequest;pub use blas::Nrm2Request;pub use blas::RotRequest;pub use blas::ScalRequest;pub use blas::SwapRequest;pub use blas::SyrkRequest;pub use blas::TrsmRequest;pub use cudnn::ActivationFwdRequest;pub use cudnn::ActivationKind;pub use cudnn::ActivationRequest;pub use cudnn::AttentionMask;pub use cudnn::AttentionParams;pub use cudnn::BatchNormRequest;pub use cudnn::ConvBwdDataRequest;pub use cudnn::ConvBwdFilterRequest;pub use cudnn::ConvDescParams;pub use cudnn::ConvForwardRequest;pub use cudnn::ConvFwdRequest;pub use cudnn::ConvParams;pub use cudnn::CudnnActor;pub use cudnn::CudnnMsg;pub use cudnn::DropoutFwdRequest;pub use cudnn::EpilogueKind;pub use cudnn::GroupNormRequest;pub use cudnn::InstanceNormRequest;pub use cudnn::LayerNormRequest;pub use cudnn::LrnFwdRequest;pub use cudnn::LrnParams;pub use cudnn::MultiHeadAttnBwdRequest;pub use cudnn::MultiHeadAttnFwdRequest;pub use cudnn::NormBwdRequest;pub use cudnn::NormMode;pub use cudnn::NormPhase;pub use cudnn::PoolBwdRequest;pub use cudnn::PoolFwdRequest;pub use cudnn::PoolMode;pub use cudnn::PoolParams;pub use cudnn::RnnBwdRequest;pub use cudnn::RnnDirection;pub use cudnn::RnnFwdRequest;pub use cudnn::RnnMode;pub use cudnn::RnnParams;pub use cudnn::SoftmaxFwdRequest;pub use cudnn::SoftmaxMode;pub use cudnn::SoftmaxRequest;pub use cudnn::TensorLayout;pub use fft::FftActor;pub use fft::FftCallbackKind;pub use fft::FftDirection;pub use fft::FftKind;pub use fft::FftMsg;pub use fft::FftPlan;pub use fft::FftPlanMany;pub use fft::FftRequest;pub use fft::PlanKey;pub use rng::Distribution;pub use rng::FillRequest;pub use rng::RngActor;pub use rng::RngGeneratorKind;pub use rng::RngMsg;pub use solver::CholeskyRequest;pub use solver::GesvdjBatchedRequest;pub use solver::GetrfBatchedRequest;pub use solver::HegvdRequest;pub use solver::LuRequest;pub use solver::LuSolveRequest;pub use solver::PotrfBatchedRequest;pub use solver::QrRequest;pub use solver::SolverActor;pub use solver::SolverDispatch;pub use solver::SolverMsg;pub use solver::SvdRequest;pub use solver::SyevdRequest;pub use solver::SygvdRequest;pub use solver::Uplo;pub use blas_lt::BlasLtActor;pub use blas_lt::BlasLtMsg;pub use blas_lt::Epilogue;pub use blas_lt::HeuristicCacheRef;pub use blas_lt::MatmulRequest;pub use blas_lt::ScaleSet;pub use blas_lt::WorkspacePool as BlasLtWorkspacePool;pub use nvrtc::KernelArg;pub use nvrtc::KernelHandle;pub use nvrtc::NvrtcActor;pub use nvrtc::NvrtcMsg;pub use nvrtc::NvrtcOpts;pub use collective::AllGatherRequest;pub use collective::AllReduceRequest;pub use collective::AllToAllRequest;pub use collective::AllToAllvRequest;pub use collective::BroadcastRequest;pub use collective::CollectiveActor;pub use collective::CollectiveMsg;pub use collective::GroupGuard;pub use collective::NcclCapabilities;pub use collective::NcclReduceSupported;pub use collective::PreMulSumOp;pub use collective::RecvRequest;pub use collective::ReduceRequest;pub use collective::ReduceScatterRequest;pub use collective::SendRequest;pub use tensor::ComputeDesc;pub use tensor::ContractRequest;pub use tensor::ElementwiseBinaryRequest;pub use tensor::ElementwiseTrinaryRequest;pub use tensor::OperandSpec;pub use tensor::PermutationRequest;pub use tensor::ReductionRequest;pub use tensor::TensorActor;pub use tensor::TensorMsg;pub use tensor::TensorSpec;
Modules§
- blas
BlasActor— full cuBLAS surface (Phase 1 cuBLAS slice).- blas_lt
BlasLtActor— wraps [cudarc::cublaslt::CudaBlasLT] for transformer-shaped fused matmul (matmul + bias + activation + aux-store + bias-grad reduction) across the full dtype matrix cuBLASLt accepts.- collective
CollectiveActor— wraps an [cudarc::nccl::Comm] for one rank within anNcclWorldActorgroup.- cudnn
CudnnActor— Phase 2 cuDNN slice. Wraps a [cudarc::cudnn::Cudnn] handle and exposes the v9 frontend graph API plus legacyConvForward/Activation/Softmaxshims for back-compat.- dispatch
- Per-actor
*Dispatchtraits + their*DispatchCtxbundles (Phase 0.3). - envelope
kernel::envelope— shared kernel-actor body factored out ofBlasActor::enqueue_sgemm.- fft
FftActor— wraps [cudarc::cufft::CudaFft] with an LRU cache of plans keyed by shape + transform kind + dtype + batch.- nvrtc
NvrtcActor— JIT-compile and launch user-supplied CUDA C++ kernels at runtime.- record
- Capture-mode contract.
- rng
RngActor— wraps a cuRANDcurandGenerator_thandle and fills device buffers with the full distribution matrix.- solver
SolverActor— wraps a [cudarc::cusolver::DnHandle] for dense linear algebra and a [cudarc::cusolver::SpHandle] for sparse solves (gatedcusolver-sp).- tensor
TensorActor— wraps cuTENSOR for contractions, reductions, permutations, and binary/trinary elementwise ops.
Structs§
- CsrMatrix
- Legacy CSR sparse matrix in device memory — kept for back-compat with
callers built against F-9. Prefer [
SparseMatrix] for new code. - Sparse
Actor
Enums§
- Activation
- Available activation for kernel fusing in matmul
- Reduce
Op - Sparse
Msg - Public messages for
SparseActor.