Skip to main content

Module heuristic

Module heuristic 

Source
Expand description

Heuristic-cache for cuBLASLt matmul algorithms.

cuBLASLt’s cublasLtMatmulAlgoGetHeuristic is a synchronous library call that takes single-digit milliseconds. For repeated shapes (every iteration of a transformer step) we cache the best-by-wall-time algorithm under (m, n, k, dtype, layout, epilogue, sm_arch) and reuse it.

Key design points:

  • LRU eviction (capacity defaults to 256 entries — large enough to cover a model’s full shape repertoire, small enough to fit in a couple KiB of host RAM).
  • The cache lives in a parking_lot::Mutex<lru::LruCache> behind an Arc, so a cloneable HeuristicCacheRef can flow into per-message BlasLtDispatchCtx without Send headaches.
  • We store the raw cublasLtMatmulAlgo_t plus a workspace_size hint; the actor’s WorkspacePool uses the workspace size to recycle the right slot.

Structs§

HeuristicCacheRef
Shareable handle to the heuristic cache. Cheap to clone.
HeuristicEntry
Cached value — best algorithm by wall-time plus the workspace size the heuristic reported.
HeuristicKey
Cache key — fully self-describing so two requests with the same shape/layout/dtype/epilogue/arch trio land in the same bucket.

Constants§

DEFAULT_HEURISTIC_CAPACITY
Default capacity of the heuristic cache.
DEFAULT_TOP_K
Default top-k of algorithms to query from cuBLASLt on each cold lookup. We keep the best by waves_count and discard the rest.