Support ByteDance-Seed/BAGEL-7B-MoT quantization in w4a16 format#1633
Support ByteDance-Seed/BAGEL-7B-MoT quantization in w4a16 format#1633lvliang-intel wants to merge 9 commits intomainfrom
Conversation
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
…upport_bagel_mot
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Pull request overview
Adds BAGEL-7B-MoT (ByteDance-Seed/BAGEL-7B-MoT) support to AutoRound’s quantization flow, including custom model loading and metadata/ignore-layer handling needed for downstream runtimes (e.g., vLLM-Omni).
Changes:
- Introduces a custom BAGEL loader and routes BAGEL through the LLM compressor flow.
- Adds BAGEL-specific block selection/ignore-layer policies and extends “extra files” copying for BAGEL sub-configs.
- Improves robustness by handling
AutoConfig.from_pretrained(...)failures for unsupported model types.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| auto_round/utils/model.py | Routes BAGEL loading, adjusts MLLM detection, and adds extra model files + quant-block hinting. |
| auto_round/utils/bagel_loader.py | New BAGEL custom loader/wrapper and save logic for vLLM-Omni compatibility. |
| auto_round/special_model_handler.py | Registers BAGEL multimodal blocks + BAGEL ignore-layer policy. |
| auto_round/modeling/unfused_moe/init.py | Makes config pre-check resilient to unsupported/unknown model types. |
| auto_round/compressors/base.py | Makes config loading resilient and adds support for model-provided quant-block hints. |
auto_round/utils/bagel_loader.py
Outdated
| def load_bagel_model(model_path, torch_dtype="auto", device_map=None): | ||
| """Load a BAGEL model for quantization. | ||
|
|
||
| Args: | ||
| model_path: Path to the BAGEL model directory. | ||
| torch_dtype: Data type for model weights. | ||
| device_map: Device map for model placement. | ||
|
|
||
| Returns: | ||
| Tuple of (model, tokenizer). | ||
| """ | ||
| # Load configs | ||
| config_path = os.path.join(model_path, "config.json") | ||
| with open(config_path, "r", encoding="utf-8") as f: | ||
| bagel_config_dict = json.load(f) | ||
|
|
There was a problem hiding this comment.
load_bagel_model() assumes model_path is a local directory and immediately opens os.path.join(model_path, "config.json"). However callers (e.g., mllm_load_model) may pass a HF repo id. Add a guard at the start to resolve repo ids to a local snapshot (e.g., if not os.path.isdir(model_path): model_path = download_or_get_path(model_path, ...)) so BAGEL loading works for both local and remote models.
There was a problem hiding this comment.
@copilot apply changes based on this feedback
| # BAGEL requires a custom loader (Qwen2 + not extensions, not in transformers) | ||
| _config_path = ( | ||
| os.path.join(pretrained_model_name_or_path, "config.json") | ||
| if os.path.isdir(pretrained_model_name_or_path) | ||
| else None | ||
| ) | ||
| if _config_path and os.path.exists(_config_path): | ||
| with open(_config_path) as _f: | ||
| _mt = json.load(_f).get("model_type") | ||
| if _mt == "bagel": | ||
| from auto_round.utils.bagel_loader import load_bagel_model | ||
|
|
||
| model, tokenizer = load_bagel_model( | ||
| pretrained_model_name_or_path, | ||
| torch_dtype=torch_dtype, | ||
| ) | ||
| model = _to_model_dtype(model, model_dtype) |
There was a problem hiding this comment.
This PR adds new BAGEL-specific branching in llm_load_model / mllm_load_model and new multimodal-detection behavior, but there are no corresponding unit tests under test/ to cover (a) BAGEL being treated as LLM-only by is_mllm_model for both local and remote paths, and (b) BAGEL routing to the custom loader. The test suite already covers similar branching for other model types (e.g., GLM image), so adding focused tests here would prevent regressions.
There was a problem hiding this comment.
@copilot apply changes based on this feedback
| # BAGEL requires a custom loader (Qwen2 + not extensions, not in transformers) | ||
| _config_path = ( | ||
| os.path.join(pretrained_model_name_or_path, "config.json") | ||
| if os.path.isdir(pretrained_model_name_or_path) | ||
| else None | ||
| ) | ||
| if _config_path and os.path.exists(_config_path): | ||
| with open(_config_path) as _f: | ||
| _mt = json.load(_f).get("model_type") | ||
| if _mt == "bagel": | ||
| from auto_round.utils.bagel_loader import load_bagel_model | ||
|
|
||
| model, tokenizer = load_bagel_model( | ||
| pretrained_model_name_or_path, | ||
| torch_dtype=torch_dtype, | ||
| ) | ||
| model = _to_model_dtype(model, model_dtype) |
There was a problem hiding this comment.
BAGEL routing here only triggers when pretrained_model_name_or_path is a local directory (checks os.path.isdir + reads local config.json). If the user passes a HF repo id (the common AutoRound flow), this branch is skipped and AutoModelForCausalLM.from_pretrained() will still be attempted, which is expected to fail for model_type=bagel. Consider detecting BAGEL for remote repos too (e.g., hf_hub_download config.json or download_or_get_path + read config.json) and then call load_bagel_model with the resolved local snapshot path.
There was a problem hiding this comment.
@copilot apply changes based on this feedback
| model_path = model_or_path if isinstance(model_or_path, str) else model_or_path.name_or_path | ||
|
|
||
| # Check model_type exclusion: some models have multimodal components | ||
| # but should be quantized as LLM (e.g., BAGEL not). | ||
| _model_type = None | ||
| if isinstance(model_or_path, torch.nn.Module) and hasattr(model_or_path, "config"): | ||
| _model_type = getattr(model_or_path.config, "model_type", None) | ||
| elif isinstance(model_path, str) and os.path.isdir(model_path): | ||
| _cfg_path = os.path.join(model_path, "config.json") | ||
| if os.path.exists(_cfg_path): | ||
| with open(_cfg_path) as _f: | ||
| _model_type = json.load(_f).get("model_type") | ||
| if _model_type in _LLM_ONLY_MODEL_TYPES: | ||
| return False | ||
| # For dummy model, model_path could be "". | ||
| if model_path and not os.path.isdir(model_path): |
There was a problem hiding this comment.
is_mllm_model() checks _LLM_ONLY_MODEL_TYPES (e.g., bagel) only before download_or_get_path() runs. For a remote HF repo id, _model_type stays None at that point, the model is downloaded, and the function then proceeds to detect multimodal artifacts (e.g., preprocessor_config.json) and will incorrectly return True for BAGEL. Move the model_type check to after the potential download (or re-check once model_path is resolved) so BAGEL is consistently treated as LLM-only for both local and remote inputs.
There was a problem hiding this comment.
@copilot apply changes based on this feedback
|
Quantize Script Run inference with vLLM-Omni(with patch for bagel mot model) CUDA_VISIBLE_DEVICES=0 python run_bagel.py --model /mnt/disk4/lvl/BAGEL-7B-MoT/ --prompt "A cute cat sitting on a windowsill" --output orginal_bagel_model_output.png
CUDA_VISIBLE_DEVICES=0 python run_bagel.py --model /mnt/disk4/lvl/BAGEL-7B-MoT-W4A16/ --prompt "A cute cat sitting on a windowsill" --output quantized_bagel_model_output.png
|
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
…-round into lvl/support_bagel_mot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Agent-Logs-Url: https://github.com/intel/auto-round/sessions/57e2e340-88e0-42e8-9528-a24ad1bc7d61 Co-authored-by: lvliang-intel <104267837+lvliang-intel@users.noreply.github.com>
|
How about upstreaming the model once it’s supported (assuming the license allows it)? There’s no need to wait for the PR to be merged. |


Description
This PR adds proper BAGEL model quantization support to the standard AutoRound LLM quantization flow and fixes the exported quantization metadata required by downstream vLLM-Omni loading.
Main changes:
(1) Route BAGEL through the LLM compressor.
(2) Load BAGEL with a dedicated custom loader because transformers does not natively recognize the bagel architecture.
(3) Gracefully handle AutoConfig.from_pretrained failures for unsupported model types such as bagel.
(4) Export the correct block_name_to_quantize metadata so downstream runtimes only quantize BAGEL LLM blocks instead of non-LLM modules like connector or vision components.
(5) Add a BAGEL-specific ignore policy to preserve image-generation-sensitive modules in FP16:
a. all moe_gen modules
b. shared attention projections (q_proj, k_proj, v_proj, o_proj)
(6) Fix
save_pretrained()in the BAGEL custom loader to usestate_dict()instead ofnamed_parameters(), ensuring registered buffers (e.g., rotary embedding caches) are included in the savedmodel.safetensorsfor correct reload and inference.Type of Change
Related Issues
#1608
Checklist Before Submitting
🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.