Expand description
Tensor Memory Accelerator (TMA) host-side descriptor builder.
On Hopper (sm_90+) and Blackwell, TMA decouples global → shared
memory tile copies from the threads that issue them: the kernel
issues cp.async.bulk.tensor.NN.global.shared against an opaque
CUtensorMap, which the host
built once via cuTensorMapEncodeTiled. The kernel then waits on a
barrier (mbarrier.try_wait) for the copy to land.
TensorMapDescriptor is a host-side builder for the tiled
flavour of the encode-call. The free encode method returns the
128-byte tensor-map struct cudarc surfaces as
cudarc::driver::sys::CUtensorMap. This module makes no attempt to
cover the im2col / im2col-wide flavours — those have shape-
specific descriptor sets that fit poorly into a uniform builder.
Structs§
- Tensor
MapDescriptor - Host-side builder for the tiled flavour of
cuTensorMapEncodeTiled.
Enums§
- Tensor
MapData Type - Element dtype consumed by the TMA. Matches
[
cudarc::driver::sys::CUtensorMapDataType_enum] one-to-one. We duplicate the enum so callers don’t depend on cudarc’ssysmodule directly (which is gated on a CUDA-version feature in cudarc). - Tensor
MapInterleave - Interleave layout for 1D / 2D / 3D bulk-tile copies.
- Tensor
MapL2 Promotion - L2 promotion hint — the subset of L2 cache the TMA is allowed to promote into.
- Tensor
MapOob Fill - Out-of-bounds fill mode for partial tiles.
- Tensor
MapSwizzle - Shared-memory swizzle pattern. A swizzled load/store interleaves rows so 4-thread bank conflicts can’t arise.
- TmaEncode
Error