Module tma

Expand description

Tensor Memory Accelerator (TMA) host-side descriptor builder.

On Hopper (sm_90+) and Blackwell, TMA decouples global → shared memory tile copies from the threads that issue them: the kernel issues cp.async.bulk.tensor.NN.global.shared against an opaque CUtensorMap, which the host built once via cuTensorMapEncodeTiled. The kernel then waits on a barrier (mbarrier.try_wait) for the copy to land.

TensorMapDescriptor is a host-side builder for the tiled flavour of the encode-call. The free encode method returns the 128-byte tensor-map struct cudarc surfaces as cudarc::driver::sys::CUtensorMap. This module makes no attempt to cover the im2col / im2col-wide flavours — those have shape- specific descriptor sets that fit poorly into a uniform builder.

Structs§

TensorMapDescriptor: Host-side builder for the tiled flavour of cuTensorMapEncodeTiled.

Enums§

TensorMapDataType: Element dtype consumed by the TMA. Matches [cudarc::driver::sys::CUtensorMapDataType_enum] one-to-one. We duplicate the enum so callers don’t depend on cudarc’s sys module directly (which is gated on a CUDA-version feature in cudarc).
TensorMapInterleave: Interleave layout for 1D / 2D / 3D bulk-tile copies.
TensorMapL2Promotion: L2 promotion hint — the subset of L2 cache the TMA is allowed to promote into.
TensorMapOobFill: Out-of-bounds fill mode for partial tiles.
TensorMapSwizzle: Shared-memory swizzle pattern. A swizzled load/store interleaves rows so 4-thread bank conflicts can’t arise.
TmaEncodeError

Module tma

Module tma Copy item path

Structs§

Enums§

Module tma