Pet project and research playground for LLM decoding acceleration.
This repository compares exact and approximate decoding strategies under a single benchmark harness, with reproducible configs, JSONL outputs, and report scripts.
- Side-by-side comparison of Baseline, Speculative Sampling, AutoJudge, Top-K, and SpecExec.
- Paper-aligned AutoJudge implementation (GSM8K label mining + LogisticRegression calibration).
- Long-run friendly workflow (resume keys, checkpoints, strict result schema validation).
- Real benchmark reports are versioned in
reports/.
| Method | Exact vs target distribution | Main idea |
|---|---|---|
baseline |
exact | Target-only decoding |
speculative |
exact | Draft proposes, target verifies |
autojudge |
approximate | Judge can accept some mismatches |
topk |
approximate | Accept mismatch if target token in top-k |
specexec |
exact | Parallel speculative branches + cache reuse |
Latest full Llama run: 2026-03-28-llama-48h-cgrid8 on RTX 5090.
Source reports:
reports/yandex_llama3_8b_3b_2026-03-28-llama-48h-cgrid8-gsm8k.mdreports/yandex_llama3_8b_3b_2026-03-28-llama-48h-cgrid8-livecodebench.md
GSM8K highlights (k=4):
| Method | Accuracy (%) | Speed (tok/s) |
|---|---|---|
| Baseline | 70.89 | 72.68 |
| Speculative | 71.89 | 40.68 |
| AutoJudge (t=0.14) | 78.67 | 45.98 |
| Top-K (all) | 75.67 | 59.29 |
LiveCodeBench highlights (throughput only):
| Method | Speed (tok/s) |
|---|---|
| Baseline | 71.52 |
| Speculative | 34.80 |
| AutoJudge (t=1.0) | 29.27 |
| Top-K (all) | 36.53 |
More context and historical runs: docs/RESULTS.md.
make setup
make check
make test
make bench-toy OUT=/tmp/bench_toy.jsonlOptional tiny HF smoke:
make smoke-hf OUT=/tmp/smoke_hf.jsonlPaper-style Qwen sweep:
make paper-evalLocal Qwen 7B/1.5B sweep:
make local-evalLocal Llama 8B/3B sweep:
bash scripts/run_llama3_8b_3b_eval.shValidate any JSONL output:
.venv/bin/python scripts/validate_results_jsonl.py --path datasets/results.jsonl --strictsp_samp/core implementations and HF adaptersbenchmarks/benchmark entrypoint and result loggingconfigs/model, method, and experiment presetsscripts/orchestration, validation, and report generationtests/unit testsreports/tracked benchmark artifactsdatasets/local datasets and run outputs (gitignored)
- Draft and target must use tokenizer-compatible vocab mapping.
- AutoJudge paper C-grid policy is
1e-7..1e0(8 values). - Reusing the same output file enables automatic resume by
resume_key.
- Contribution guide:
CONTRIBUTING.md - Open issues and feature proposals: GitHub issue templates
- Current priorities:
docs/ROADMAP.md - Repository presentation checklist:
docs/GITHUB_SETUP.md
MIT. See LICENSE.