Expand description
fp8 scaling-factor helpers.
Hopper+ fp8 cuBLAS calls (cublasGemmEx with CUDA_R_8F_E4M3 /
CUDA_R_8F_E5M2 operands) take a per-tensor or per-row scaling
factor that brings the input/output values into the representable
fp8 range. This module factors out the small bookkeeping helpers
used by both cuBLAS and cuBLASLt fp8 paths so they don’t have to
be duplicated.
The full fp8 path lights up under the cublas-fp8 cargo feature
(currently scaffolded — Phase 1 cuBLAS slice ships the helper
types, the wired call site lives in cuBLASLt’s own module).
Structs§
- PerRow
Scale - Per-row scaling factor: a vector of
mscalars, one per row of the matrix. Stored device-side; the caller passes aGpuRef<f32>when the cuBLASLt descriptor accepts row-wise amax. - PerTensor
Scale - Per-tensor scaling factor: a single multiplicative scalar.