Skip to main content

Module envelope

Module envelope 

Source
Expand description

kernel::envelope — shared kernel-actor body factored out of BlasActor::enqueue_sgemm.

Every library actor (BlasActor, CudnnActor, FftActor, RngActor, …) follows the same pattern:

  1. Validate every input GpuRef via GpuRef::access and turn the strong Arc<CudaSlice<T>> into a temporary owner that keeps the buffer alive past kernel completion.
  2. Synchronously enqueue the kernel onto the actor’s stream. The enqueue body is library-specific and provided as a closure.
  3. Spawn an async task that awaits the configured CompletionStrategy, delivers the reply on a oneshot::Sender, and only then drops the temporary owners (so that the kernel can’t outlive its inputs).

The envelope handles step 3 uniformly. Pre-launch errors are reported synchronously through the same oneshot. Post-launch errors arrive through the completion future.

§Single-writer enforcement

cudarc’s library APIs typically take &mut Dst for the write target. cudarc 0.19 satisfies this for CudaSlice<T>. Since a GpuRef<T> wraps Arc<CudaSlice<T>>, callers that want write access to a buffer must hold the unique reference to that GpuRef (so Arc::try_unwrap succeeds inside the actor). Each library actor enforces this contract explicitly — the envelope does not, because some libraries (cuBLAS gemm with non-zero beta) read-modify-write the output while others (cuDNN forward conv) write to a freshly allocated output.

§Observability hooks (Phase 0.7)

KernelEnvelope is an opt-in builder that wraps the same pipeline with two observability surfaces:

  • A KernelTrace callback that fires four lifecycle events (before_enqueue, after_enqueue, before_complete, after_complete). The trait is always compiled — when no trace is set, the envelope skips the calls entirely.
  • An optional NVTX range label. When the nvtx cargo feature is on, the synchronous-enqueue body is wrapped in a cudarc::nvtx::safe::scoped_range guard. When the feature is off, the field is unused and adds no runtime cost.

Existing callers that use the free run_kernel function continue to behave byte-for-byte identically: that path constructs a default (trace-less, nvtx-less) envelope.

Structs§

KernelEnvelope
Builder/configuration for a single run_kernel invocation.
KernelInfo
Per-launch metadata passed to every KernelTrace callback.

Traits§

KernelTrace
Lifecycle hook receiver. All four methods have empty default bodies, so a custom trace can override only the events it cares about.

Functions§

access_all_2
Validate two input GpuRefs and return owning Arcs of their underlying slices. Fails fast (synchronously) with GpuRefStale if either is invalid.
access_all_3
Validate three input GpuRefs and return owning Arcs of their underlying slices.
access_all_4
Validate four input GpuRefs and return owning Arcs. Used by cuDNN convolution which takes (input, filter, output, workspace).
run_kernel
Run the synchronous-enqueue + async-completion-await pipeline.