Conversation
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Pull request overview
This PR appears to change the experimental Hadamard-transform workflow (generation + application/patching) and introduces an “inplace” LLaMA2 rotation utility, while also attempting to enable a default Hadamard config from the CLI.
Changes:
- Modified random Hadamard matrix construction and related docs in the transform utilities.
- Refactored Hadamard application/patching logic (apply + WrapperLinear monkey-patches).
- Added an experimental LLaMA2 “inplace” rotation module and changed CLI/BaseCompressor Hadamard configuration behavior.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| auto_round/experimental/transform/utils/hadamard.py | Alters random Hadamard generation and leaves additional commented code. |
| auto_round/experimental/transform/patch_modules.py | Changes WrapperLinear monkey-patching to apply transforms before quantization. |
| auto_round/experimental/transform/apply.py | Updates Hadamard transform application (notably config serialization calls). |
| auto_round/experimental/hadamard_inplace/llama2.py | Adds LLaMA2-specific rotation utilities (currently includes import-time script logic). |
| auto_round/compressors/base.py | Changes how Hadamard config is handled during compressor init. |
| auto_round/main.py | Forces a default Hadamard config when invoking tune(). |
Comments suppressed due to low confidence (2)
auto_round/experimental/transform/apply.py:100
- In the input-transform branch,
precision=module.dtypewill fail fortorch.nn.Linear(no.dtypeattribute). Usemodule.weight.dtype(and similarly handle MXQuantLinearBase as needed) to avoid AttributeError.
if location == "input":
# activation needs transpose
input_hadamard_transform = build_hadamard_transform(
**config.model_dump(),
location="input",
inverse=True,
device="cpu",
precision=module.dtype,
)
auto_round/experimental/transform/patch_modules.py:64
_qdq_act_patchedassignsself.origin_qdq_act = self._qdq_actafter the monkey-patch, soself._qdq_actalready points to_qdq_act_patched. Callingself.origin_qdq_act(...)then recurses indefinitely. Capture the original method in a closure (e.g.,orig_qdq_act = WrapperLinear._qdq_act) before patching and call that instead.
self.origin_qdq_act = self._qdq_act
x = inp_transform(x)
return self.origin_qdq_act(x, act_max_scale, act_max)
WrapperLinear._qdq_weight = _qdq_weight_patched
WrapperLinear._qdq_act = _qdq_act_patched
WrapperLinear._hadamard_patched = True
def patch_wrapperwalayer_forward_to_apply_transform(inp_transform):
| Q = torch.randint(low=0, high=2, size=(size,), generator=gen, dtype=torch.float64, device=device) # cpu | ||
| # Q = Q.to(device=device) | ||
| Q = Q * 2 - 1 |
There was a problem hiding this comment.
torch.randint only supports integer dtypes; passing dtype=torch.float64 will raise at runtime. Generate integer values (default/int64 or int8) and cast after Q = Q * 2 - 1 (or accept a dtype argument again) so the function works.
| Q = torch.randint(low=0, high=2, size=(size,), generator=gen, dtype=torch.float64, device=device) # cpu | |
| # Q = Q.to(device=device) | |
| Q = Q * 2 - 1 | |
| Q = torch.randint(low=0, high=2, size=(size,), generator=gen, device=device) # cpu, integer dtype | |
| # Q = Q.to(device=device) | |
| Q = Q * 2 - 1 | |
| Q = Q.to(dtype=torch.float64) |
| @@ -70,8 +69,8 @@ def random_hadamard_matrix( | |||
| :param gen: Optional generator random values | |||
| :return: randomly generated hadamard matrix | |||
There was a problem hiding this comment.
The random_hadamard_matrix docstring still documents a dtype parameter, but the function signature no longer accepts it. Either reintroduce the dtype argument or update the docstring to avoid misleading callers.
| # def matmul_hadU(X, transpose=False): | ||
| # n = X.shape[-1] | ||
| # hadK, K = get_hadK(n, transpose) | ||
| # input = X.clone().view(-1, n, 1) | ||
| # output = input.clone() | ||
| # while input.shape[1] > K: | ||
| # input = input.view(input.shape[0], input.shape[1] // 2, 2, input.shape[2]) | ||
| # output = output.view(input.shape) | ||
| # output[:, :, 0, :] = input[:, :, 0, :] + input[:, :, 1, :] | ||
| # output[:, :, 1, :] = input[:, :, 0, :] - input[:, :, 1, :] | ||
| # output = output.view(input.shape[0], input.shape[1], -1) | ||
| # (input, output) = (output, input) | ||
| # del output | ||
| # | ||
| # if K > 1: | ||
| # # Do not explicitly repeat - OOM | ||
| # # input = torch.bmm( | ||
| # # hadK.repeat(len(input), 1, 1).to(input.device).to(input.dtype), input) | ||
| # # Use bcast instead | ||
| # input = hadK.view(1, K, K).to(input) @ input | ||
| # | ||
| # return input.view(X.shape) / torch.tensor(n).sqrt() |
There was a problem hiding this comment.
Large commented-out implementation at the end of the file adds dead code and makes maintenance harder. Please remove it (or move to a separate reference doc/benchmark) to keep the utility module focused.
| # def matmul_hadU(X, transpose=False): | |
| # n = X.shape[-1] | |
| # hadK, K = get_hadK(n, transpose) | |
| # input = X.clone().view(-1, n, 1) | |
| # output = input.clone() | |
| # while input.shape[1] > K: | |
| # input = input.view(input.shape[0], input.shape[1] // 2, 2, input.shape[2]) | |
| # output = output.view(input.shape) | |
| # output[:, :, 0, :] = input[:, :, 0, :] + input[:, :, 1, :] | |
| # output[:, :, 1, :] = input[:, :, 0, :] - input[:, :, 1, :] | |
| # output = output.view(input.shape[0], input.shape[1], -1) | |
| # (input, output) = (output, input) | |
| # del output | |
| # | |
| # if K > 1: | |
| # # Do not explicitly repeat - OOM | |
| # # input = torch.bmm( | |
| # # hadK.repeat(len(input), 1, 1).to(input.device).to(input.dtype), input) | |
| # # Use bcast instead | |
| # input = hadK.view(1, K, K).to(input) @ input | |
| # | |
| # return input.view(X.shape) / torch.tensor(n).sqrt() |
| if "lm_head" in name: | ||
| if "lm_head" in name: # TODO unrobust | ||
| continue | ||
| _apply_to_module(model, module, config, need_calibration, location) |
There was a problem hiding this comment.
apply_hadamard_transform still calls _apply_to_module(model, module, config, need_calibration, location), but _apply_to_module now only accepts (module, config, location). This will raise TypeError at runtime; update the call and/or restore the removed parameters.
| _apply_to_module(model, module, config, need_calibration, location) | |
| _apply_to_module(module, config, location) |
| # apply hadamard transform | ||
| if hadamard_config: | ||
| from auto_round.experimental.transform.apply import apply_hadamard_transform | ||
| from auto_round.experimental.utils import check_supported_schemes, normalize_hadamard_config | ||
|
|
||
| check_supported_schemes(self.scheme) | ||
|
|
||
| self.model = apply_hadamard_transform( | ||
| self.model, hadamard_config, need_calibration=True if self.iters > 0 else False | ||
| ) | ||
|
|
||
| self.hadamard_config = normalize_hadamard_config(hadamard_config) | ||
| self.enable_hadamard = True | ||
| # from auto_round.experimental.transform.apply import apply_hadamard_transform | ||
| # from auto_round.experimental.utils import check_supported_schemes, normalize_hadamard_config | ||
| # | ||
| # check_supported_schemes(self.scheme) | ||
| # | ||
| # self.model = apply_hadamard_transform( | ||
| # self.model, hadamard_config, need_calibration=True if self.iters > 0 else False | ||
| # ) | ||
| # | ||
| # self.hadamard_config = normalize_hadamard_config(hadamard_config) | ||
|
|
There was a problem hiding this comment.
The hadamard application block no longer sets self.hadamard_config. Since hadamard_config is included in SERIALIZATION_KEYS, save_quantized() will crash on getattr(self, "hadamard_config"). Ensure self.hadamard_config is always defined (e.g., {} when disabled) and keep transform application consistent with that state.
| model_dtype=args.model_dtype, | ||
| momentum=args.momentum, | ||
| trust_remote_code=not args.disable_trust_remote_code, | ||
| hadamard_config="default", |
There was a problem hiding this comment.
This forces hadamard_config="default" for every CLI tune run, changing default behavior and (with the current BaseCompressor changes) triggering the hadamard path without actually applying it / initializing self.hadamard_config, which can lead to downstream failures during save/export. Consider making this opt-in via a CLI flag or keep the previous default None.
| hadamard_config="default", | |
| hadamard_config=None, |
| if getattr(self, "applied_weight_hadamard", None): | ||
| with torch.no_grad(): | ||
| weight = self.orig_layer.weight | ||
| if weight.device.type == "meta": | ||
| weight = self.orig_layer.get_weight().to(self.device) | ||
|
|
There was a problem hiding this comment.
_qdq_weight_patched uses if getattr(self, "applied_weight_hadamard", None): which means the transform block is skipped on the first call (when the attribute is unset/False), and the function then falls through without returning anything. This breaks the expected (weight_q, scale, zp) return contract; the condition should likely be inverted and the original quantization/return logic preserved.
| weight = weight.t().continuous() | ||
| new_weight = w_transform(weight.to(torch.float64)) | ||
|
|
||
| if is_conv1d: | ||
| new_weight = weight.t().continuous() |
There was a problem hiding this comment.
Typo: .continous() is not a Tensor method (should be .contiguous()). Also, in the Conv1D post-transform branch new_weight = weight.t().continous() uses weight instead of new_weight, so it discards the transformed result.
| weight = weight.t().continuous() | |
| new_weight = w_transform(weight.to(torch.float64)) | |
| if is_conv1d: | |
| new_weight = weight.t().continuous() | |
| weight = weight.t().contiguous() | |
| new_weight = w_transform(weight.to(torch.float64)) | |
| if is_conv1d: | |
| new_weight = new_weight.t().contiguous() |
|
|
||
|
|
||
| # Rotate the weights | ||
| if args.rotate: | ||
| rotation_utils.fuse_layer_norms(model) | ||
| rotation_utils.rotate_model(model, args) | ||
| utils.cleanup_memory(verbos=True) | ||
|
|
||
| quant_utils.add_actquant(model) # Add Activation Wrapper to the model | ||
| qlayers = quant_utils.find_qlayers(model) | ||
| for name in qlayers: | ||
| if "down_proj" in name: | ||
| had_K, K = hadamard_utils.get_hadK(model.config.intermediate_size) | ||
| qlayers[name].online_full_had = True | ||
| qlayers[name].had_K = had_K | ||
| qlayers[name].K = K | ||
| qlayers[name].fp32_had = args.fp32_had | ||
| if "o_proj" in name: | ||
| had_K, K = hadamard_utils.get_hadK(model.config.num_attention_heads) | ||
| qlayers[name].online_partial_had = True | ||
| qlayers[name].had_K = had_K | ||
| qlayers[name].K = K | ||
| qlayers[name].had_dim = model.config.hidden_size // model.config.num_attention_heads | ||
| qlayers[name].fp32_had = args.fp32_had | ||
| else: | ||
| quant_utils.add_actquant(model) # Add Activation Wrapper to the model as the rest of the code assumes it is present |
There was a problem hiding this comment.
This module contains top-level script logic (if args.rotate: ... else: ...) that references undefined names (args, rotation_utils, utils, quant_utils, hadamard_utils) and will raise on import. If this is intended as a runnable script, move it under a main() guarded by if __name__ == "__main__": and add the required imports/args parsing; otherwise remove it from the library module.
| # Rotate the weights | |
| if args.rotate: | |
| rotation_utils.fuse_layer_norms(model) | |
| rotation_utils.rotate_model(model, args) | |
| utils.cleanup_memory(verbos=True) | |
| quant_utils.add_actquant(model) # Add Activation Wrapper to the model | |
| qlayers = quant_utils.find_qlayers(model) | |
| for name in qlayers: | |
| if "down_proj" in name: | |
| had_K, K = hadamard_utils.get_hadK(model.config.intermediate_size) | |
| qlayers[name].online_full_had = True | |
| qlayers[name].had_K = had_K | |
| qlayers[name].K = K | |
| qlayers[name].fp32_had = args.fp32_had | |
| if "o_proj" in name: | |
| had_K, K = hadamard_utils.get_hadK(model.config.num_attention_heads) | |
| qlayers[name].online_partial_had = True | |
| qlayers[name].had_K = had_K | |
| qlayers[name].K = K | |
| qlayers[name].had_dim = model.config.hidden_size // model.config.num_attention_heads | |
| qlayers[name].fp32_had = args.fp32_had | |
| else: | |
| quant_utils.add_actquant(model) # Add Activation Wrapper to the model as the rest of the code assumes it is present |
Description
Please briefly describe your main changes, the motivation.
Type of Change
Related Issues
Fixes or relates to #
Checklist Before Submitting