Expand description
Phase 5: Hopper / Blackwell host-side primitives. The module
surface is always compiled (the tma::TensorMapDescriptor builder
and cluster::LaunchSpec types are useful even on hosts that don’t
link a Hopper driver). The hopper cargo feature gates the FFI
implementations of cuTensorMapEncodeTiled / cudaLaunchKernelExC.
Hopper / Blackwell primitives (Phase 5).
Hopper (sm_90 / sm_90a) introduced four kernel-side primitives that materially change how high-throughput CUDA kernels are written:
- Tensor Memory Accelerator (TMA) — bulk asynchronous tensor
copies between global and shared memory described by an opaque
CUtensorMap(built host-side viacuTensorMapEncodeTiled). See [tma]. - WGMMA — warp-group matrix multiply accumulate, the
successor to MMA. Issued from a warpgroup of 128 threads via
wgmma.mma_async.sync. See [wgmma]. cp.async— already on Ampere, but Hopper adds the bulk-asynchronous TMA-drivencp.async.bulkthat fences with barrier objects. See [cp_async].- Thread-block clusters — a new launch dimension above grid /
block that exposes Distributed Shared Memory (DSM) and the
cluster.syncbarrier. See [cluster].
Blackwell (sm_100 / sm_120) adds the second-generation TMA, larger
cluster sizes, the new fp4 / fp6 / mxfp variants, and tensor memory
(TMEM) that backs tcgen05.mma. The blackwell cargo feature gates
the additional intrinsics; the host-side wrappers are shared with
Hopper through this module.
§Layout
- [
tma] —TensorMapDescriptorbuilder + the safe wrapper aroundcuTensorMapEncodeTiled. - [
wgmma] — public re-exports of the macro-definedwgmma_*intrinsics (definitions live ininclude/atomr_hopper.cuh). - [
cp_async] —cp.asyncpipeline macro shims. - [
cluster] — [LaunchSpec] and the safe wrapper aroundcudaLaunchKernelExCfor cluster-dim launches; DSM helpers.
Re-exports§
pub use cluster::ClusterDim;pub use cluster::LaunchSpec;pub use tma::TensorMapDataType;pub use tma::TensorMapDescriptor;pub use tma::TensorMapInterleave;pub use tma::TensorMapSwizzle;
Modules§
- cluster
- Thread-block cluster launches + Distributed Shared Memory (DSM) helpers.
- cp_
async cp.asyncpipeline macro shim.- tma
- Tensor Memory Accelerator (TMA) host-side descriptor builder.
- wgmma
- WGMMA (warp-group matrix multiply accumulate) intrinsic shim.
Constants§
- ATOMR_
HOPPER_ HEADER_ REL_ PATH - Path to the vendored hopper header (
atomr_hopper.cuh) shipped alongside the crate. NVRTC kernels can--include-paththis and#include "atomr_hopper.cuh"to pick up the wgmma / cp.async / cluster macro shims.
Functions§
- atomr_
hopper_ header_ path - Returns the absolute filesystem path to
atomr_hopper.cuhif it exists alongside the crate sources. ReturnsNonefor installations that strip theinclude/directory (e.g. crates.io binary publication of just the compiled lib).