Module heuristic

Expand description

Heuristic-cache for cuBLASLt matmul algorithms.

cuBLASLt’s cublasLtMatmulAlgoGetHeuristic is a synchronous library call that takes single-digit milliseconds. For repeated shapes (every iteration of a transformer step) we cache the best-by-wall-time algorithm under (m, n, k, dtype, layout, epilogue, sm_arch) and reuse it.

Key design points:

LRU eviction (capacity defaults to 256 entries — large enough to cover a model’s full shape repertoire, small enough to fit in a couple KiB of host RAM).
The cache lives in a parking_lot::Mutex<lru::LruCache> behind an Arc, so a cloneable HeuristicCacheRef can flow into per-message BlasLtDispatchCtx without Send headaches.
We store the raw cublasLtMatmulAlgo_t plus a workspace_size hint; the actor’s WorkspacePool uses the workspace size to recycle the right slot.

Structs§

HeuristicCacheRef: Shareable handle to the heuristic cache. Cheap to clone.
HeuristicEntry: Cached value — best algorithm by wall-time plus the workspace size the heuristic reported.
HeuristicKey: Cache key — fully self-describing so two requests with the same shape/layout/dtype/epilogue/arch trio land in the same bucket.

Constants§

DEFAULT_HEURISTIC_CAPACITY: Default capacity of the heuristic cache.
DEFAULT_TOP_K: Default top-k of algorithms to query from cuBLASLt on each cold lookup. We keep the best by waves_count and discard the rest.

Module heuristic

Module heuristic Copy item path

Structs§

Constants§

Module heuristic