Releases: WorldFlowAI/semblend
Releases · WorldFlowAI/semblend
v0.3.0 — Fuzzy Chunk Matching + Benchmark Framework
What's New
Fuzzy Chunk Matching
- Confidence-gated fuzzy alignment recovers 100% KV reuse on shifted-prefix scenarios (vs 0% with exact-only matching)
- PQ segment store for memory-efficient chunk comparison (32x compression, ~137MB at 100K donors)
- Three-tier decision: fast_reuse / verified_reuse / recompute
- Configurable per-chunk confidence scoring (overlap + positional coherence + position delta decay)
Benchmark Framework
- Paper reproduction suite (
benchmarks/suite/reproduce.py) - Tiered validation runner (exact replay, cross-instruction, reorder, multi-turn, RAG template)
- Pre-flight verification (GPU type, patched LMCache, fuzzy matching)
- Bootstrap 95% CIs on all results
- Log parser for ground-truth SemBlend hit detection
Results (A100, Qwen2.5-7B-AWQ)
- TriviaQA: 26.0% hit (paper: 24.8%)
- Cross-instruction: 87-100% hit, 2.15-2.42x speedup
- Fuzzy shifted-prefix: 100% hit, 2.25x speedup
- Quality PPL: ≤1.007 (paper bound: ≤1.065)
SemBlend v0.2.0
SemBlend v0.2.0
New Integrations
- TRT-LLM:
SemBlendBackend, KV cache layout adapter, model engine hooks,SemanticCacheLookupProvider+PostPrefixLoadHookupstream ABCs, turnkey launcher (semblend-trtllm) - Dynamo KVBM:
SemBlendKvIndexerWrapper,SemBlendEventPublisher,SemanticKvIndexerRust crate implementing Dynamo'sKvIndexerInterface - dynamo-semblend Rust crate: 16 tests, SIMD cosine search, embedding sidecar client
Embedding Improvements
- Full-document parallel segmented embedding — 100% document coverage via overlapping 512-token windows with mean pooling
- MiniLM GPU auto-detection — uses last available GPU to avoid contending with inference model
- Removed sentence sorting — segmented mean pooling is inherently order-invariant (0.996 cosine for reordered docs). Sorting was fragile (broke on code/markdown) and hurt cross-instruction similarity.
Benchmark Results (SGLang, Qwen2.5-7B-Instruct, A10G)
| Dataset | v0.1.1 (with sorting) | v0.2.0 (no sorting) | Delta |
|---|---|---|---|
| TriviaQA | 3.5% hit | 22.6% hit | +19.1pp |
| SCBench | 4.0% | 13.6% | +9.6pp |
| WikiText103 | 10.0% | 15.7% | +5.7pp |
| LongEval | 9.7% | 15.2% | +5.5pp |
| NarrativeQA | 16.7% | 17.4% | +0.7pp |
On-hit speedups: LongEval avg 10.36x (max 27.88x), NarrativeQA avg 2.03x, TriviaQA avg 1.61x
Install
pip install semblend # core
pip install semblend[vllm] # + vLLM/LMCache
pip install semblend[sglang] # + SGLang
pip install semblend[trtllm] # + TRT-LLM
pip install semblend[embedder] # + sentence-transformersCacheBlend Note
For selective layer recomputation (CacheBlend), vLLM requires PR #37339.
Test Coverage
- 117 Python tests (TRT-LLM: 54, Dynamo: 14, core/SGLang/vLLM: 49)
- 16 Rust tests (dynamo-semblend)
- 15 Rust tests (semrouter)