Skip to content

Releases: WorldFlowAI/semblend

v0.3.0 — Fuzzy Chunk Matching + Benchmark Framework

22 Mar 20:18

Choose a tag to compare

What's New

Fuzzy Chunk Matching

  • Confidence-gated fuzzy alignment recovers 100% KV reuse on shifted-prefix scenarios (vs 0% with exact-only matching)
  • PQ segment store for memory-efficient chunk comparison (32x compression, ~137MB at 100K donors)
  • Three-tier decision: fast_reuse / verified_reuse / recompute
  • Configurable per-chunk confidence scoring (overlap + positional coherence + position delta decay)

Benchmark Framework

  • Paper reproduction suite (benchmarks/suite/reproduce.py)
  • Tiered validation runner (exact replay, cross-instruction, reorder, multi-turn, RAG template)
  • Pre-flight verification (GPU type, patched LMCache, fuzzy matching)
  • Bootstrap 95% CIs on all results
  • Log parser for ground-truth SemBlend hit detection

Results (A100, Qwen2.5-7B-AWQ)

  • TriviaQA: 26.0% hit (paper: 24.8%)
  • Cross-instruction: 87-100% hit, 2.15-2.42x speedup
  • Fuzzy shifted-prefix: 100% hit, 2.25x speedup
  • Quality PPL: ≤1.007 (paper bound: ≤1.065)

SemBlend v0.2.0

21 Mar 15:56

Choose a tag to compare

SemBlend v0.2.0

New Integrations

  • TRT-LLM: SemBlendBackend, KV cache layout adapter, model engine hooks, SemanticCacheLookupProvider + PostPrefixLoadHook upstream ABCs, turnkey launcher (semblend-trtllm)
  • Dynamo KVBM: SemBlendKvIndexerWrapper, SemBlendEventPublisher, SemanticKvIndexer Rust crate implementing Dynamo's KvIndexerInterface
  • dynamo-semblend Rust crate: 16 tests, SIMD cosine search, embedding sidecar client

Embedding Improvements

  • Full-document parallel segmented embedding — 100% document coverage via overlapping 512-token windows with mean pooling
  • MiniLM GPU auto-detection — uses last available GPU to avoid contending with inference model
  • Removed sentence sorting — segmented mean pooling is inherently order-invariant (0.996 cosine for reordered docs). Sorting was fragile (broke on code/markdown) and hurt cross-instruction similarity.

Benchmark Results (SGLang, Qwen2.5-7B-Instruct, A10G)

Dataset v0.1.1 (with sorting) v0.2.0 (no sorting) Delta
TriviaQA 3.5% hit 22.6% hit +19.1pp
SCBench 4.0% 13.6% +9.6pp
WikiText103 10.0% 15.7% +5.7pp
LongEval 9.7% 15.2% +5.5pp
NarrativeQA 16.7% 17.4% +0.7pp

On-hit speedups: LongEval avg 10.36x (max 27.88x), NarrativeQA avg 2.03x, TriviaQA avg 1.61x

Install

pip install semblend                # core
pip install semblend[vllm]          # + vLLM/LMCache
pip install semblend[sglang]        # + SGLang
pip install semblend[trtllm]        # + TRT-LLM
pip install semblend[embedder]      # + sentence-transformers

CacheBlend Note

For selective layer recomputation (CacheBlend), vLLM requires PR #37339.

Test Coverage

  • 117 Python tests (TRT-LLM: 54, Dynamo: 14, core/SGLang/vLLM: 49)
  • 16 Rust tests (dynamo-semblend)
  • 15 Rust tests (semrouter)