Expand description
kernel::envelope — shared kernel-actor body factored out of
BlasActor::enqueue_sgemm.
Every library actor (BlasActor, CudnnActor, FftActor,
RngActor, …) follows the same pattern:
- Validate every input
GpuRefviaGpuRef::accessand turn the strongArc<CudaSlice<T>>into a temporary owner that keeps the buffer alive past kernel completion. - Synchronously enqueue the kernel onto the actor’s stream. The enqueue body is library-specific and provided as a closure.
- Spawn an async task that awaits the configured
CompletionStrategy, delivers the reply on aoneshot::Sender, and only then drops the temporary owners (so that the kernel can’t outlive its inputs).
The envelope handles step 3 uniformly. Pre-launch errors are
reported synchronously through the same oneshot. Post-launch
errors arrive through the completion future.
§Single-writer enforcement
cudarc’s library APIs typically take &mut Dst for the write
target. cudarc 0.19 satisfies this for CudaSlice<T>. Since a
GpuRef<T> wraps Arc<CudaSlice<T>>, callers that want write
access to a buffer must hold the unique reference to that
GpuRef (so Arc::try_unwrap succeeds inside the actor). Each
library actor enforces this contract explicitly — the envelope
does not, because some libraries (cuBLAS gemm with non-zero beta)
read-modify-write the output while others (cuDNN forward conv)
write to a freshly allocated output.
§Observability hooks (Phase 0.7)
KernelEnvelope is an opt-in builder that wraps the same
pipeline with two observability surfaces:
- A
KernelTracecallback that fires four lifecycle events (before_enqueue,after_enqueue,before_complete,after_complete). The trait is always compiled — when no trace is set, the envelope skips the calls entirely. - An optional NVTX range label. When the
nvtxcargo feature is on, the synchronous-enqueue body is wrapped in acudarc::nvtx::safe::scoped_rangeguard. When the feature is off, the field is unused and adds no runtime cost.
Existing callers that use the free run_kernel function continue
to behave byte-for-byte identically: that path constructs a default
(trace-less, nvtx-less) envelope.
Structs§
- Kernel
Envelope - Builder/configuration for a single
run_kernelinvocation. - Kernel
Info - Per-launch metadata passed to every
KernelTracecallback.
Traits§
- Kernel
Trace - Lifecycle hook receiver. All four methods have empty default bodies, so a custom trace can override only the events it cares about.
Functions§
- access_
all_ 2 - Validate two input
GpuRefs and return owningArcs of their underlying slices. Fails fast (synchronously) withGpuRefStaleif either is invalid. - access_
all_ 3 - Validate three input
GpuRefs and return owningArcs of their underlying slices. - access_
all_ 4 - Validate four input
GpuRefs and return owningArcs. Used by cuDNN convolution which takes (input, filter, output, workspace). - run_
kernel - Run the synchronous-enqueue + async-completion-await pipeline.