Code for the paper "SharQ: Bridging Activation Sparsity and FP4 Quantization for LLM Inference"
SharQ is a Blackwell-oriented LLM quantization repo built around this idea: use a sparse FP4 main path to capture important activations, then use a dense FP4 residual path to recover the loss from sparsification and quantization.
The repo currently provides three practical modes:
NVFP4: dense FP4 baselineSHARQ: fused sparse-residual FP4 kernel pathSHARQ_SIM: pure PyTorch fake-quantized, fake-sparse simulation for accuracy-only reference
Clone the repo together with CUTLASS:
git clone --recurse-submodules https://github.com/actypedef/SharQ.git
cd SharQIf you already cloned the repo without submodules:
git submodule update --init --recursiveCreate and activate a conda environment named sharq with Python 3.10:
conda create -n sharq python=3.10 -y
conda activate sharqmodel/ model wrappers and evaluation entry point
kernels/ CUDA extension, CUTLASS sparse GEMM integration, low-level benchmarks
benchmarks/ correctness, perf, and ablation scripts
demo/ simple local chat demo
For kernel structure, build details, and low-level CUDA benchmarks, see kernels/README.md.
- Python
3.10 - PyTorch with CUDA support
- CUDA
12.8recommended - Blackwell
sm_120aGPU for the realSHARQkernel path
Install dependencies:
pip install -r requirements.txt
conda install pybind11 -yBuild the CUDA extension:
cmake -S kernels -B kernels/build_cmake_sm120a \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
-DPython3_EXECUTABLE=$(which python)
cmake --build kernels/build_cmake_sm120a --target sharq_ops -jIf you only want a numerical reference path, SHARQ_SIM does not require the CUDA extension.
Perplexity:
python model/main.py /path/to/model \
--dataset wikitext2 \
--eval_ppl \
--quant_type SHARQ
# disable RMSNorm fusion for accuracy ablations
python model/main.py /path/to/model \
--dataset wikitext2 \
--eval_ppl \
--quant_type SHARQ \
--disable_rmsnorm_fusionSimulation-only reference:
python model/main.py /path/to/model \
--dataset wikitext2 \
--eval_ppl \
--quant_type SHARQ_SIMZero-shot lm-eval:
python model/main.py /path/to/model \
--tasks piqa,arc_challenge,boolq,hellaswag,winogrande,lambada_openai,arc_easy \
--lm_eval_num_fewshot 0 \
--quant_type SHARQOr use:
bash evaluate.sh /path/to/model SHARQSingle-turn example:
python demo/chat_qwen.py \
--model ../Qwen2.5-7B-Instruct \
--quant-type SHARQ \
--prompt "Briefly explain SharQ." \
--max-new-tokens 128Interactive multi-turn example:
python demo/chat_qwen.py \
--model ../Qwen2.5-7B-Instruct \
--quant-type SHARQThe benchmark scripts are grouped under:
See benchmarks/README.md for a short guide.
End-to-end prefill benchmark example:
python benchmarks/e2e/benchmark_prefill_e2e.py \
--model /path/to/model \
--quant-type SHARQ \
--batch-size 1 \
--seqlen 2048
# compare against the unfused RMSNorm path
python benchmarks/e2e/benchmark_prefill_e2e.py \
--model /path/to/model \
--quant-type SHARQ \
--batch-size 1 \
--seqlen 2048 \
--disable-rmsnorm-fusionThe same script also supports BF16, NVFP4, and SHARQ_SIM for side-by-side prefill comparison.
Model evaluation keeps extra fusion disabled for accuracy measurements, while the benchmark and demo paths enable both RMSNorm fusion and epilogue fusion automatically.
- Llama
- Qwen2
- Mixtral