Expand description
Heuristic-cache for cuBLASLt matmul algorithms.
cuBLASLt’s cublasLtMatmulAlgoGetHeuristic is a synchronous
library call that takes single-digit milliseconds. For repeated
shapes (every iteration of a transformer step) we cache the
best-by-wall-time algorithm under
(m, n, k, dtype, layout, epilogue, sm_arch) and reuse it.
Key design points:
- LRU eviction (capacity defaults to 256 entries — large enough to cover a model’s full shape repertoire, small enough to fit in a couple KiB of host RAM).
- The cache lives in a
parking_lot::Mutex<lru::LruCache>behind anArc, so a cloneableHeuristicCacheRefcan flow into per-messageBlasLtDispatchCtxwithoutSendheadaches. - We store the raw
cublasLtMatmulAlgo_tplus aworkspace_sizehint; the actor’sWorkspacePooluses the workspace size to recycle the right slot.
Structs§
- Heuristic
Cache Ref - Shareable handle to the heuristic cache. Cheap to clone.
- Heuristic
Entry - Cached value — best algorithm by wall-time plus the workspace size the heuristic reported.
- Heuristic
Key - Cache key — fully self-describing so two requests with the same shape/layout/dtype/epilogue/arch trio land in the same bucket.
Constants§
- DEFAULT_
HEURISTIC_ CAPACITY - Default capacity of the heuristic cache.
- DEFAULT_
TOP_ K - Default top-k of algorithms to query from cuBLASLt on each cold
lookup. We keep the best by
waves_countand discard the rest.