Skip to content

new architecture for auto_round#1542

Open
n1ck-guo wants to merge 39 commits intomainfrom
hengguo/new_ar_arch
Open

new architecture for auto_round#1542
n1ck-guo wants to merge 39 commits intomainfrom
hengguo/new_ar_arch

Conversation

@n1ck-guo
Copy link
Copy Markdown
Contributor

@n1ck-guo n1ck-guo commented Mar 13, 2026

Description

  • Compressor:
    Main entry point responsible for orchestrating the workflow, invoking different algorithms, and handling model persistence. Supports block-wise or layer-wise quantization strategies. Primary subclasses include TuneCompressor and ZeroShotCompressor.
  • Calibration: Handles the calibration process (Work in Progress)
  • Context: Manages shared configurations and model states throughout the quantization pipeline, providing centralized control to prevent cross-module dependencies
    • ModelContext: Handles model loading and tracks model states and relevant configurations
    • CompressContext: Stores shared compression settings such as low_cpu_mem_usage, enable_torch_compile, etc.
  • Algorithms: Concrete quantization and weight transformation implementations
    • Quantization: Various quantization algorithms, including AutoRound, RTN, OptRTN, etc.
    • Transform: Weight transformation algorithms such as Hadamard transform

Usage of new api:

from auto_round.algorithms.rotation import HadamardConfig 

quant_cfg  = AutoRoundConfig(bits=4, group_size=128, iters=200)
had_cfg_1  = HadamardConfig(hadamard_type="hadamard",        block_size=32)
had_cfg_2  = HadamardConfig(hadamard_type="random_hadamard", block_size=64, random_seed=True)

compressor = Compressor(
    config=[quant_cfg, had_cfg_1, had_cfg_2], 
    model="facebook/opt-125m",
    scheme="MXFP4",
    format="auto_round",
)

model, layer_config = compressor.quantize_and_save(
    output_dir="./output",
)

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Other (please specify):

Related Issues

Fixes or relates to #

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.

Signed-off-by: n1ck-guo <heng.guo@intel.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors AutoRound toward a new “context + compressor + algorithm” architecture, introducing new compressors_new/ and context/ modules and updating scheme parsing/export helpers to support the new flow.

Changes:

  • Added new context singletons (ModelContext, CompressContext) and a new compressors_new implementation path.
  • Expanded scheme parsing to reconcile bits/data_type and support user overrides + AutoScheme integration.
  • Added new calibration utilities and algorithm scaffolding for quantization backends (AutoRound/RTN).

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 18 comments.

Show a summary per file
File Description
auto_round/utils/model.py Avoids runtime import cycles via TYPE_CHECKING for QuantizationScheme.
auto_round/schemes.py Adds scheme override + parsing helpers and bits/dtype reconciliation.
auto_round/formats.py Switches divisibility checks to global supported-layer constants.
auto_round/context/model_context.py Introduces model lifecycle/loading + AMP setup and forward-hook management.
auto_round/context/compress_context.py Introduces device/device_map and memory-usage knobs as shared context.
auto_round/context/base.py Adds simple singleton context base.
auto_round/context/init.py Package init for new context module.
auto_round/compressors_new/utils.py New utility module (layer config, gguf mapping, caching helpers, forward helpers).
auto_round/compressors_new/shard_writer.py New shard-based saver with optional safetensors support.
auto_round/compressors_new/config.py Introduces extra/legacy config dataclasses for the new compressor path.
auto_round/compressors_new/base.py New “BaseCompressor” implementation wiring contexts, formats, caching, quant loop.
auto_round/compressors_new/init.py Package init for compressors_new.
auto_round/compressors/utils.py Extends legacy layer-config resolution to include safetensors-only tensors and skip missing modules.
auto_round/calibration/utils.py Adds helpers for “early stop” caching and input reshaping for block tuning.
auto_round/calibration/init.py Package init for calibration.
auto_round/algorithms/quantization/rtn/rtn.py Adds placeholder RTN quantization module file.
auto_round/algorithms/quantization/rtn/config.py Adds RTN algorithm config stub.
auto_round/algorithms/quantization/rtn/init.py Package init for RTN quantization.
auto_round/algorithms/quantization/base.py Adds base quantization class stub.
auto_round/algorithms/quantization/auto_round/quantize.py Adds new AutoRound quantizer implementation (algorithm object).
auto_round/algorithms/quantization/auto_round/config.py Adds new AutoRound algorithm config.
auto_round/algorithms/quantization/auto_round/init.py Package init for AutoRound quantization algorithm.
auto_round/algorithms/quantization/init.py Package init for quantization algorithms.
auto_round/algorithms/base.py Adds base algorithm stub.
auto_round/algorithms/alg_config.py Adds base algorithm config stub.
auto_round/algorithms/init.py Package init for algorithms.

@wenhuach21
Copy link
Copy Markdown
Contributor

If there is already an algorithm folder, what is the purpose of the compressor folder?

@n1ck-guo n1ck-guo requested review from WeiweiZhang1 and yiliu30 and removed request for xin3he March 13, 2026 05:31
import torch


class ExtraConfig:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExtraConfig is a monolithic catch-all config class.
ExtraConfig bundles tuning, scheme, MLLM, and diffusion settings into a single class — the opposite of llm-compressor's approach where each modifier owns its own typed config. This "one object owns everything" pattern makes it harder to add new algorithms independently and is a carryover from the old monolithic design rather than a step toward the intended modular architecture.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite this PR's goal of separating concerns into Context/Algorithm/Compressor, BaseCompressor still owns everything: config parsing, calibration data collection, forward hook management, quantization loop control, and model saving. By contrast, llm-compressor distributes these responsibilities across dedicated Pipeline (calibration), Modifier (algorithm logic), Session (lifecycle orchestration), and entrypoint (API) layers. The refactor restructures the file layout without achieving real decoupling.

@chensuyue chensuyue added this to the 0.12.0 milestone Mar 16, 2026
n1ck-guo and others added 3 commits March 17, 2026 17:02
Signed-off-by: n1ck-guo <heng.guo@intel.com>
Signed-off-by: n1ck-guo <heng.guo@intel.com>
set_module(self.model, name, m)
tuning_device = m.tuning_device if hasattr(m, "tuning_device") else self.compress_context.device
# Step 1: let gguf merge layers or rename module first and we will handle the RTN is gguf specific logic
if self.compress_context.is_immediate_packing and self.compress_context.formats[0].is_gguf():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better decouple formats from algorithms

Signed-off-by: n1ck-guo <heng.guo@intel.com>
Signed-off-by: n1ck-guo <heng.guo@intel.com>
Signed-off-by: n1ck-guo <heng.guo@intel.com>
…nsor helper, safetensor_only_matched, dispatch None guard, extend ignore_layers

Signed-off-by: n1ck-guo <heng.guo@intel.com>
Signed-off-by: n1ck-guo <heng.guo@intel.com>
@n1ck-guo n1ck-guo requested a review from yiliu30 March 31, 2026 05:16
Signed-off-by: n1ck-guo <heng.guo@intel.com>
Signed-off-by: n1ck-guo <heng.guo@intel.com>
Signed-off-by: n1ck-guo <heng.guo@intel.com>
The output tensor of the block.
"""

output = defaultdict(list)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better wrap this function to a global funtion, like get_block_outputs(index, save_output), the developer does not need to care about different model type or cache device

)
from auto_round.logger import logger
from auto_round.modeling.fused_moe.replace_modules import materialize_model_
from auto_round.sign_sgd import SignSGD
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mv sign_sgd to this foler



class ARQuantizer(BaseQuantizers):
is_adam: bool = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to move is_adam; an algorithm should only be responsible for its own logic.

)
# Call this before quantization and after applying the block wrapper.
if self.config.is_nv_fp: # enable qkv and moe structure global_scale fuse.
from auto_round.data_type.utils import update_fused_layer_global_scales
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to move this somewhere else?

)

if self.compress_context.low_gpu_mem_usage:
clear_memory(device_list=self.compress_context.device_list) # clear cached memory during training
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how bout coding a context to conduct the wrapper and unwrapper and clear memory

Returns:
dict: Empty dict (zero-shot RTN has no tunable parameters to return).
"""
shard_writer = ShardWriter.get_shard_writer()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better decouple shardwriter

if lm_head_name is not None:
tied_weights_layers.append(lm_head_name)

materialize_model_(block)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the above should be moved elsewhere and do not require algorithm development to implement.

m.to("meta")

# Move remaining GPU tensors to CPU; offload to disk if low_cpu_mem_usage.
if not self.compress_context.is_immediate_saving:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issue

Signed-off-by: n1ck-guo <heng.guo@intel.com>
**kwargs,
):
self.quantize_config = None
self.rotation_configs: list[BaseRotationConfig] = []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not supported to use two or more rotation configs sequentially on the same model. we can support laywise rotation configs (this is not related to this pr). So, in this line, I think we just support one rotation config here


def __init__(
self,
layer_config: dict[str, Union[str, dict]] = None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move layer_config and data_config(nsamples,seqlen) to the API (i.e., the compressor in this PR). As discussed offline, compressor should be renamed to AutoRound. Algorithm-specific kwargs may override data_config, but should not override layer_config.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoRound(alg_config=[xxx], nsamples=512, layer_config=xxx)

@lkk12014402
Copy link
Copy Markdown
Contributor

lkk12014402 commented Mar 31, 2026

Description

  • Compressor:
    Main entry point responsible for orchestrating the workflow, invoking different algorithms, and handling model persistence. Supports block-wise or layer-wise quantization strategies. Primary subclasses include TuneCompressor and ZeroShotCompressor.

  • Calibration: Handles the calibration process (Work in Progress)

  • Context: Manages shared configurations and model states throughout the quantization pipeline, providing centralized control to prevent cross-module dependencies

    • ModelContext: Handles model loading and tracks model states and relevant configurations
    • CompressContext: Stores shared compression settings such as low_cpu_mem_usage, enable_torch_compile, etc.
  • Algorithms: Concrete quantization and weight transformation implementations

    • Quantization: Various quantization algorithms, including AutoRound, RTN, OptRTN, etc.
    • Transform: Weight transformation algorithms such as Hadamard transform

Usage of new api:

from auto_round.algorithms.rotation import HadamardConfig 

quant_cfg  = AutoRoundConfig(bits=4, group_size=128, iters=200)
had_cfg_1  = HadamardConfig(hadamard_type="hadamard",        block_size=32)
had_cfg_2  = HadamardConfig(hadamard_type="random_hadamard", block_size=64, random_seed=True)

compressor = Compressor(
    config=[quant_cfg, had_cfg_1, had_cfg_2], 
    model="facebook/opt-125m",
    scheme="MXFP4",
    format="auto_round",
)

model, layer_config = compressor.quantize_and_save(
    output_dir="./output",
)

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Other (please specify):

Related Issues

Fixes or relates to #

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.

@n1ck-guo For the new API usage, would it be better to determine the order of applying configs based on the order in the config list?
If so, the rotation config probably shouldn’t be applied inside Compressor __init__; instead, all configs should be applied through a loop, like https://github.com/vllm-project/llm-compressor/blob/main/examples/transform/spinquant_example.py#L19 and http://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/pipelines/independent/pipeline.py#L38 .

@wenhuach21
Copy link
Copy Markdown
Contributor

wenhuach21 commented Mar 31, 2026

Description

  • Compressor:
    Main entry point responsible for orchestrating the workflow, invoking different algorithms, and handling model persistence. Supports block-wise or layer-wise quantization strategies. Primary subclasses include TuneCompressor and ZeroShotCompressor.

  • Calibration: Handles the calibration process (Work in Progress)

  • Context: Manages shared configurations and model states throughout the quantization pipeline, providing centralized control to prevent cross-module dependencies

    • ModelContext: Handles model loading and tracks model states and relevant configurations
    • CompressContext: Stores shared compression settings such as low_cpu_mem_usage, enable_torch_compile, etc.
  • Algorithms: Concrete quantization and weight transformation implementations

    • Quantization: Various quantization algorithms, including AutoRound, RTN, OptRTN, etc.
    • Transform: Weight transformation algorithms such as Hadamard transform

Usage of new api:

from auto_round.algorithms.rotation import HadamardConfig 

quant_cfg  = AutoRoundConfig(bits=4, group_size=128, iters=200)
had_cfg_1  = HadamardConfig(hadamard_type="hadamard",        block_size=32)
had_cfg_2  = HadamardConfig(hadamard_type="random_hadamard", block_size=64, random_seed=True)

compressor = Compressor(
    config=[quant_cfg, had_cfg_1, had_cfg_2], 
    model="facebook/opt-125m",
    scheme="MXFP4",
    format="auto_round",
)

model, layer_config = compressor.quantize_and_save(
    output_dir="./output",
)

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Other (please specify):

Related Issues

Fixes or relates to #

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.

@n1ck-guo For the new API usage, would it be better to determine the order of applying configs based on the order in the config list? If so, the rotation config probably shouldn’t be applied inside init; instead, all configs should be applied through a loop, like https://github.com/vllm-project/llm-compressor/blob/main/examples/transform/spinquant_example.py#L19 and http://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/pipelines/independent/pipeline.py#L38

Yes, you’re right. We should preserve the algorithm order as provided by the user unless there are technical limitations. However, for unsupported or suboptimal orders, such as applying AR before Hadamard, we should log a warning and provide a recommended order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants