Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe PR introduces MoE expert token counting and visualization capabilities by adding a utility to generate HTML reports of per-expert token distribution, integrating token tracking into the MoE quantization pipeline with calibration support, implementing MoE block detection, and providing comprehensive test coverage for the new functionality. Changes
Sequence DiagramsequenceDiagram
participant Model as Model
participant MoE as _QuantSparseMoe
participant Gate as Gate/Router
participant Hook as Forward Hook
participant Export as HF Export
Model->>MoE: forward(is_calib=True)
MoE->>MoE: Expand gate.top_k to num_experts
MoE->>Gate: forward(all tokens to all experts)
Gate->>Hook: trigger forward hook
Hook->>Hook: accumulate per-expert token counts
Hook-->>MoE: return
MoE->>MoE: Restore original gate.top_k
MoE->>Gate: forward(normal routing with is_calib=False)
Gate-->>MoE: return routed output
MoE-->>Model: return output
Model->>Export: export_hf_checkpoint(model)
Export->>MoE: read expert_token_count
Export->>Export: generate HTML token distribution table
Export->>Export: write .moe.html file
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #900 +/- ##
==========================================
- Coverage 73.74% 73.54% -0.21%
==========================================
Files 199 205 +6
Lines 21183 22000 +817
==========================================
+ Hits 15621 16179 +558
- Misses 5562 5821 +259 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
1dd5bb1 to
578ede0
Compare
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
578ede0 to
919be0f
Compare
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/quantization/plugins/huggingface.py (1)
490-506:⚠️ Potential issue | 🟠 Major
assert hasattr(self, "gate")will fire for fallback-detected blocks on transformers ≥ 5.0.
_is_sparse_moe_blockhas two detection paths:
- Primary:
gate.top_k + gate.num_experts— gate is present.- Fallback:
block.top_k + block.num_experts(nogaterequired, e.g.gateis a plainnn.Linear).When a module is registered via the fallback path and
TRANSFORMERS_VERSION_GE_5_0 = True, the forward enters the>= 5.0branch at Line 490 which immediately assertshasattr(self, "gate"). A plain-linear gate module satisfies the assertion, but itsgate.top_kandgate.num_expertsattributes won't exist, causingAttributeErroron Line 494. The>= 5.0branch should only be taken for modules wheregategenuinely exposestop_k/num_experts(the primary detection path).🛡️ Proposed fix: guard v5.0 branch on gate attribute structure
-if TRANSFORMERS_VERSION_GE_5_0: - assert hasattr(self, "gate") +if TRANSFORMERS_VERSION_GE_5_0 and hasattr(self, "gate") and hasattr(self.gate, "top_k"): # Path for transformers >= 5.0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 490 - 506, The current forward branch for TRANSFORMERS_VERSION_GE_5_0 wrongly assumes self.gate always exposes top_k and num_experts; replace the assert with a conditional that checks hasattr(self, "gate") and hasattr(self.gate, "top_k") and hasattr(self.gate, "num_experts") and only then run the "gate" path (save/modify gate.top_k, call super().forward, restore); otherwise fall back to the legacy path that uses self.top_k/self.num_experts/self.experts (same logic as the else branch). Update the forward method around the TRANSFORMERS_VERSION_GE_5_0 check to guard on the gate attribute structure instead of unconditionally taking the v5 branch.
🧹 Nitpick comments (1)
modelopt/torch/export/moe_utils.py (1)
73-73: Specifyencodinginwrite_textfor cross-platform safety.
Path.write_textdefaults to the locale encoding (which can be non-UTF-8 on Windows). Since the HTML content may contain model-layer names with non-ASCII characters in principle, explicitly specifyingencoding="utf-8"is the safer choice and aligns the<meta charset>expectation of any browser opening the file.♻️ Proposed fix
-output_path.write_text(html_content) +output_path.write_text(html_content, encoding="utf-8")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/moe_utils.py` at line 73, The call to Path.write_text when saving the generated HTML uses the system locale by default; update the write operation that calls output_path.write_text(html_content) to explicitly pass encoding="utf-8" so the saved file matches the HTML meta charset and avoids non-ASCII issues—locate the write in moe_utils.py where html_content is written to output_path and add the encoding argument to the write_text call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/export/moe_utils.py`:
- Around line 41-52: The code assumes a constant expert count by using
num_experts = rows[0][1].shape[0] to build the table header, which breaks when
MoE layers have different expert counts; change the header generation to compute
the maximum expert count across rows (e.g., max(r[1].shape[0] for r in rows))
and use that to create html_parts header cells, and when emitting each row (the
loop that writes per-layer <td> cells) pad shorter rows with empty <td></td>
cells (or colspan equivalently) up to that max so every row has the same number
of columns and the table remains aligned. Ensure you update references to
num_experts accordingly and keep html_parts.extend and row emission logic
consistent.
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1004-1007: The call to save_expert_token_count_table is inside the
same try as _export_transformers_checkpoint so any failure aborts the whole
export and prevents model.save_pretrained from running; move or guard this
diagnostic write so it cannot raise out of the main export flow — either call
save_expert_token_count_table after the try/except that handles
_export_transformers_checkpoint, or wrap
save_expert_token_count_table(export_dir, model) in its own try/except that
catches Exception, logs a non-fatal warning (including the exception), and
continues; ensure post_state_dict and hf_quant_config from
_export_transformers_checkpoint remain unaffected and model.save_pretrained is
always called even if the .moe.html write fails.
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 452-465: The _setup method can silently create an empty
expert_token_count when num_experts remains 0; update _setup (method name:
_setup, symbols: self.gate, gate.num_experts, self.num_experts, self.experts,
experts.num_experts, expert_token_count, save_expert_token_count_table) to
detect when num_experts == 0 after the three lookups and emit a clear warning
(e.g., warnings.warn or self.logger.warning) indicating the layer's routing
won't be tracked and naming the module (use self.__class__.__name__ or similar),
then proceed (or skip registering the hook) so callers know to fix the model
instead of silently losing tracking.
In `@tests/unit/torch/quantization/plugins/test_sparse_moe.py`:
- Around line 193-199: The test's expected_num_experts currently only checks
moe_block.num_experts which doesn't follow the _setup resolution; update the
test around QuantModuleRegistry.convert and expected_num_experts to mirror
_setup's fallback order by resolving num_experts as gate.num_experts →
moe_block.num_experts (self.num_experts) → moe_block.experts.num_experts, then
assert converted.expert_token_count.shape equals (resolved_num_experts,) and
dtype/zero-content as before so the assertion reflects the same resolution logic
used in _setup.
- Around line 295-310: The test accesses converted.gate.in_features which breaks
for transformers ≥5.0 where converted.gate is a TopKRouter; update the
hidden_size/top_k retrieval to branch on TRANSFORMERS_VERSION_GE_5_0 (same
pattern used earlier lines 264–284): when TRANSFORMERS_VERSION_GE_5_0 use the
TopKRouter's input-size attribute and top_k attribute as used previously,
otherwise use converted.gate.in_features and converted.gate.top_k; ensure you
reference converted.gate and converted.top_k (or converted.gate.top_k)
consistently so the subsequent tensor x creation and top_k calculation work for
both versions.
---
Outside diff comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 490-506: The current forward branch for
TRANSFORMERS_VERSION_GE_5_0 wrongly assumes self.gate always exposes top_k and
num_experts; replace the assert with a conditional that checks hasattr(self,
"gate") and hasattr(self.gate, "top_k") and hasattr(self.gate, "num_experts")
and only then run the "gate" path (save/modify gate.top_k, call super().forward,
restore); otherwise fall back to the legacy path that uses
self.top_k/self.num_experts/self.experts (same logic as the else branch). Update
the forward method around the TRANSFORMERS_VERSION_GE_5_0 check to guard on the
gate attribute structure instead of unconditionally taking the v5 branch.
---
Nitpick comments:
In `@modelopt/torch/export/moe_utils.py`:
- Line 73: The call to Path.write_text when saving the generated HTML uses the
system locale by default; update the write operation that calls
output_path.write_text(html_content) to explicitly pass encoding="utf-8" so the
saved file matches the HTML meta charset and avoids non-ASCII issues—locate the
write in moe_utils.py where html_content is written to output_path and add the
encoding argument to the write_text call.
| num_experts = rows[0][1].shape[0] | ||
| html_parts = [ | ||
| "<html><head><style>", | ||
| "table { border-collapse: collapse; font-family: monospace; }", | ||
| "th, td { border: 1px solid #ccc; padding: 4px 8px; text-align: right; }", | ||
| "th { background: #f0f0f0; }", | ||
| "</style></head><body>", | ||
| "<h2>Expert Token Counts (per MoE layer)</h2>", | ||
| "<table><tr><th>Layer/Expert</th>", | ||
| ] | ||
| html_parts.extend(f"<th>{i}</th>" for i in range(num_experts)) | ||
| html_parts.append("</tr>") |
There was a problem hiding this comment.
Assumes all MoE layers have the same number of experts.
num_experts = rows[0][1].shape[0] is used to build the table header once, but subsequent rows each emit one <td> per their own expert count. If any two MoE layers in a model have different expert counts (e.g., fine-grained vs. shared experts in DeepSeek-style models, or future heterogeneous architectures), the HTML table will have misaligned columns — rows with fewer experts than the header will be missing cells, and rows with more will overflow.
🛡️ Proposed fix: build header dynamically from max width
-num_experts = rows[0][1].shape[0]
+num_experts = max(counts.shape[0] for _, counts in rows)
html_parts = [
...
"<table><tr><th>Layer/Expert</th>",
]
html_parts.extend(f"<th>{i}</th>" for i in range(num_experts))With this fix, each row loop may also need padding:
for c in counts.tolist():
...
html_parts.append(f"<td{style}>{c}</td>")
+# Pad missing expert columns for layers with fewer experts
+for _ in range(num_experts - len(counts)):
+ html_parts.append("<td>-</td>")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/export/moe_utils.py` around lines 41 - 52, The code assumes a
constant expert count by using num_experts = rows[0][1].shape[0] to build the
table header, which breaks when MoE layers have different expert counts; change
the header generation to compute the maximum expert count across rows (e.g.,
max(r[1].shape[0] for r in rows)) and use that to create html_parts header
cells, and when emitting each row (the loop that writes per-layer <td> cells)
pad shorter rows with empty <td></td> cells (or colspan equivalently) up to that
max so every row has the same number of columns and the table remains aligned.
Ensure you update references to num_experts accordingly and keep
html_parts.extend and row emission logic consistent.
| try: | ||
| post_state_dict, hf_quant_config = _export_transformers_checkpoint(model, dtype) | ||
|
|
||
| save_expert_token_count_table(model, export_dir) |
There was a problem hiding this comment.
save_expert_token_count_table failure aborts model export.
The call sits inside the try block alongside _export_transformers_checkpoint. If writing .moe.html fails for any reason (disk full, permission error, unexpected exception for an edge-case module), the exception propagates to the outer except, which warns and re-raises — preventing model.save_pretrained from running. A diagnostic HTML report should not gate the primary export operation.
♻️ Proposed fix: isolate the diagnostic step
+ try:
save_expert_token_count_table(model, export_dir)
+ except Exception as report_err:
+ warnings.warn(
+ f"Failed to save expert token count table: {report_err}. "
+ "Model export will continue."
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/export/unified_export_hf.py` around lines 1004 - 1007, The
call to save_expert_token_count_table is inside the same try as
_export_transformers_checkpoint so any failure aborts the whole export and
prevents model.save_pretrained from running; move or guard this diagnostic write
so it cannot raise out of the main export flow — either call
save_expert_token_count_table after the try/except that handles
_export_transformers_checkpoint, or wrap
save_expert_token_count_table(export_dir, model) in its own try/except that
catches Exception, logs a non-fatal warning (including the exception), and
continues; ensure post_state_dict and hf_quant_config from
_export_transformers_checkpoint remain unaffected and model.save_pretrained is
always called even if the .moe.html write fails.
| def _setup(self): | ||
| pass | ||
| num_experts = 0 | ||
| if hasattr(self, "gate") and hasattr(self.gate, "num_experts"): | ||
| num_experts = self.gate.num_experts | ||
| elif hasattr(self, "num_experts"): | ||
| num_experts = self.num_experts | ||
| elif hasattr(self, "experts") and hasattr(self.experts, "num_experts"): | ||
| num_experts = self.experts.num_experts | ||
|
|
||
| self.expert_token_count = torch.zeros(num_experts, dtype=torch.long, device="cpu") | ||
| self._count_expert_tokens = False | ||
|
|
||
| if hasattr(self, "gate"): | ||
| self.gate.register_forward_hook(self._gate_forward_hook) |
There was a problem hiding this comment.
Silent expert_token_count of shape (0,) when num_experts cannot be resolved.
If none of the three attribute lookups (gate.num_experts, self.num_experts, experts.num_experts) succeeds, num_experts stays 0 and expert_token_count is silently initialized to an empty tensor. The save_expert_token_count_table function will silently skip this layer (due to the numel() > 0 guard), giving no indication that the layer's routing was not tracked. A warning here would help debugging.
🛡️ Proposed fix
+import warnings
...
self.expert_token_count = torch.zeros(num_experts, dtype=torch.long, device="cpu")
+if num_experts == 0:
+ warnings.warn(
+ f"Could not determine num_experts for {type(self).__name__}; "
+ "expert_token_count will not be tracked.",
+ stacklevel=2,
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 452 - 465,
The _setup method can silently create an empty expert_token_count when
num_experts remains 0; update _setup (method name: _setup, symbols: self.gate,
gate.num_experts, self.num_experts, self.experts, experts.num_experts,
expert_token_count, save_expert_token_count_table) to detect when num_experts ==
0 after the three lookups and emit a clear warning (e.g., warnings.warn or
self.logger.warning) indicating the layer's routing won't be tracked and naming
the module (use self.__class__.__name__ or similar), then proceed (or skip
registering the hook) so callers know to fix the model instead of silently
losing tracking.
|
|
||
| converted = QuantModuleRegistry.convert(moe_block) | ||
| assert hasattr(converted, "expert_token_count") | ||
| expected_num_experts = moe_block.num_experts if hasattr(moe_block, "num_experts") else 0 | ||
| assert converted.expert_token_count.shape == (expected_num_experts,) | ||
| assert converted.expert_token_count.dtype == torch.long | ||
| assert (converted.expert_token_count == 0).all() |
There was a problem hiding this comment.
expected_num_experts doesn't mirror _setup's fallback chain, creating a weak assertion.
_setup resolves num_experts in this order: gate.num_experts → self.num_experts → experts.num_experts. The test at Line 196 only checks moe_block.num_experts (block-level). For Qwen3 MoE on transformers ≥ 5.0, if num_experts lives on gate.num_experts and not at block level, expected_num_experts becomes 0 and the assertion converted.expert_token_count.shape == (0,) vacuously passes even when _setup correctly initialized a non-zero count.
🛡️ Proposed fix: mirror the _setup resolution order
-expected_num_experts = moe_block.num_experts if hasattr(moe_block, "num_experts") else 0
+if hasattr(moe_block, "gate") and hasattr(moe_block.gate, "num_experts"):
+ expected_num_experts = moe_block.gate.num_experts
+elif hasattr(moe_block, "num_experts"):
+ expected_num_experts = moe_block.num_experts
+elif hasattr(moe_block, "experts") and hasattr(moe_block.experts, "num_experts"):
+ expected_num_experts = moe_block.experts.num_experts
+else:
+ expected_num_experts = 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/torch/quantization/plugins/test_sparse_moe.py` around lines 193 -
199, The test's expected_num_experts currently only checks moe_block.num_experts
which doesn't follow the _setup resolution; update the test around
QuantModuleRegistry.convert and expected_num_experts to mirror _setup's fallback
order by resolving num_experts as gate.num_experts → moe_block.num_experts
(self.num_experts) → moe_block.experts.num_experts, then assert
converted.expert_token_count.shape equals (resolved_num_experts,) and
dtype/zero-content as before so the assertion reflects the same resolution logic
used in _setup.
| converted = QuantModuleRegistry.convert(moe_block) | ||
|
|
||
| # Reset counts and enable counting | ||
| converted.expert_token_count.zero_() | ||
| converted._count_expert_tokens = True | ||
|
|
||
| hidden_size = converted.gate.in_features | ||
| x = torch.randn(8, hidden_size) | ||
| with torch.no_grad(): | ||
| converted.gate(x) | ||
|
|
||
| # After one gate call with counting enabled, total assigned tokens should equal | ||
| # num_tokens * top_k | ||
| top_k = converted.top_k if hasattr(converted, "top_k") else converted.gate.top_k | ||
| total_assigned = converted.expert_token_count.sum().item() | ||
| assert total_assigned == 8 * top_k |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check if the test file exists and read the relevant section
fd -t f "test_sparse_moe.py" --exec cat -n {} \; | head -350 | tail -100Repository: NVIDIA/Model-Optimizer
Length of output: 4523
🏁 Script executed:
# Search for how gate is set up and what type it might be
rg "converted\.gate|TopKRouter" -A 3 -B 3 tests/Repository: NVIDIA/Model-Optimizer
Length of output: 6136
🏁 Script executed:
# Check if TRANSFORMERS_VERSION_GE_5_0 is defined in the codebase
rg "TRANSFORMERS_VERSION_GE_5_0" --type pyRepository: NVIDIA/Model-Optimizer
Length of output: 591
🏁 Script executed:
# Check imports in the test file to understand dependencies
fd -t f "test_sparse_moe.py" -exec head -50 {} \;Repository: NVIDIA/Model-Optimizer
Length of output: 296
🏁 Script executed:
# Find _FakeGateWithRouter class definition to understand TopKRouter interface
rg "_FakeGateWithRouter" -A 20 tests/unit/torch/quantization/plugins/test_sparse_moe.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1497
🏁 Script executed:
# Check if there's a way to get hidden_size from TopKRouter elsewhere in codebase
rg "gate\." tests/unit/torch/quantization/plugins/test_sparse_moe.py | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 815
🏁 Script executed:
# Search for how gate's hidden size is obtained in other parts of the codebase
rg "hidden.*size|in_features" tests/unit/torch/quantization/plugins/test_sparse_moe.pyRepository: NVIDIA/Model-Optimizer
Length of output: 154
🏁 Script executed:
# Check the sparse_moe.py implementation file to see how gate is handled
fd -t f "sparse_moe.py" -path "*/quantization/plugins/*" --exec head -100 {} \;Repository: NVIDIA/Model-Optimizer
Length of output: 234
🏁 Script executed:
# Find sparse_moe.py files
find . -name "sparse_moe.py" -type fRepository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Check the actual implementation to see how gate is created and used
rg "class.*MoE|def convert" modelopt/torch/quantization/plugins/ -A 5 | grep -A 20 "sparse"Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Look at the implementation to understand gate structure in v5.0+
rg "in_features|hidden_size" modelopt/torch/quantization/plugins/ --type py | grep -i gateRepository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Search for sparse_moe implementation files
find . -type f -name "*.py" | xargs grep -l "sparse.*moe\|SparseMoE" | head -10Repository: NVIDIA/Model-Optimizer
Length of output: 258
🏁 Script executed:
# Look for the implementation that wraps/converts the MoE block
rg "class.*MoE.*Block|convert.*moe" modelopt/torch/quantization/plugins/ -B 2 -A 10 --type pyRepository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Check what attributes TopKRouter should have by looking at how it's used throughout
rg "gate\.in_features|gate\.hidden" . --type pyRepository: NVIDIA/Model-Optimizer
Length of output: 173
🏁 Script executed:
# Look at what the converted MoE block looks like - search for expert_token_count
rg "expert_token_count" . --type py -B 5 -A 5Repository: NVIDIA/Model-Optimizer
Length of output: 13178
🏁 Script executed:
# Look at huggingface.py for how they handle gate attributes
head -150 modelopt/torch/quantization/plugins/huggingface.pyRepository: NVIDIA/Model-Optimizer
Length of output: 5604
🏁 Script executed:
# Check if there's any documentation or comments about TopKRouter's attributes
rg "TopKRouter|linear\." modelopt/torch/quantization/plugins/huggingface.py -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 1084
🏁 Script executed:
# Check what the actual converted gate looks like - look for wrap/conversion logic
rg "def convert|class.*QuantMoE" modelopt/torch/quantization/plugins/huggingface.py -A 10Repository: NVIDIA/Model-Optimizer
Length of output: 1216
🏁 Script executed:
# Check the _FakeGateWithRouter more carefully to understand what attributes TopKRouter should have
rg "_FakeGateWithRouter" -A 30 tests/unit/torch/quantization/plugins/test_sparse_moe.py | head -50Repository: NVIDIA/Model-Optimizer
Length of output: 1746
🏁 Script executed:
# Look for how hidden_size or input size is accessed elsewhere in the test file
rg "hidden_size|hidden.*state|input.*size" tests/unit/torch/quantization/plugins/test_sparse_moe.py -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 1183
🏁 Script executed:
# Check how the gate is used in the actual implementation - what shape does it expect
rg "gate.*forward\|gate\(" modelopt/torch/quantization/plugins/huggingface.py -B 3 -A 3Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Look at the test imports to understand the version
head -50 tests/unit/torch/quantization/plugins/test_sparse_moe.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1799
🏁 Script executed:
# Look at what the actual Qwen3 model looks like when it's loaded
rg "get_tiny_qwen3_moe" tests/ -A 5 | head -30Repository: NVIDIA/Model-Optimizer
Length of output: 2612
🏁 Script executed:
# Search for actual TopKRouter usage in transformers or the codebase with real models
rg "from transformers" modelopt/torch/quantization/plugins/huggingface.py | head -10Repository: NVIDIA/Model-Optimizer
Length of output: 600
🏁 Script executed:
# Check if there's a way to get the hidden dimension from gate - look for other gate attribute accesses
rg "gate\." modelopt/torch/quantization/plugins/huggingface.py | grep -v "hasattr\|gate\.top_k\|gate\.num_experts"Repository: NVIDIA/Model-Optimizer
Length of output: 134
🏁 Script executed:
# Look at the full test to understand the context better - what does converted look like
sed -n '238,320p' tests/unit/torch/quantization/plugins/test_sparse_moe.pyRepository: NVIDIA/Model-Optimizer
Length of output: 3163
🏁 Script executed:
# Search for actual TopKRouter implementation or how it's used with hidden_size
rg "class.*TopKRouter|hidden_size" modelopt/ --type py -B 2 -A 5Repository: NVIDIA/Model-Optimizer
Length of output: 50379
🏁 Script executed:
# Check the _get_moe_block helper to understand what model is being used
rg "_get_moe_block" tests/unit/torch/quantization/plugins/test_sparse_moe.py -A 10Repository: NVIDIA/Model-Optimizer
Length of output: 3095
🏁 Script executed:
# Look for config attributes that might hold hidden size
rg "config\.|hidden.*dim" tests/unit/torch/quantization/plugins/test_sparse_moe.pyRepository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Search for how the hidden size is determined elsewhere when dealing with gates
rg "hidden.*size|in_features" modelopt/torch/quantization/plugins/ --type py -B 3 -A 3Repository: NVIDIA/Model-Optimizer
Length of output: 11026
🏁 Script executed:
# Let me look at how converted.gate is used in other parts of the test to infer its structure
sed -n '1,100p' tests/unit/torch/quantization/plugins/test_sparse_moe.py | tail -50Repository: NVIDIA/Model-Optimizer
Length of output: 1929
🏁 Script executed:
# Check what the actual MoE block structure looks like by searching for how hidden_size is obtained elsewhere
rg "moe_block\.|converted\." tests/unit/torch/quantization/plugins/test_sparse_moe.py | grep -v "assert\|hasattr\|if"Repository: NVIDIA/Model-Optimizer
Length of output: 637
🏁 Script executed:
# Search for how gate input size is determined in the implementation
rg "gate.*\(" modelopt/torch/quantization/plugins/huggingface.py -B 5 | head -40Repository: NVIDIA/Model-Optimizer
Length of output: 1675
Add TRANSFORMERS_VERSION_GE_5_0 check before accessing gate attributes at line 301.
Line 301 accesses converted.gate.in_features without version checking. For transformers ≥ 5.0, converted.gate is a TopKRouter (not nn.Linear), so in_features is not a direct attribute. The test already uses this version check pattern at lines 264–284 for handling the same gate type difference; apply the same pattern here to get the input size correctly for both versions.
The exact attribute on TopKRouter that holds the input size should be verified from your transformers installation, but the structure should match the pattern already in use for top_k access:
-hidden_size = converted.gate.in_features
+if TRANSFORMERS_VERSION_GE_5_0:
+ hidden_size = converted.gate.linear.in_features # or appropriate TopKRouter attr
+else:
+ hidden_size = converted.gate.in_features🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/torch/quantization/plugins/test_sparse_moe.py` around lines 295 -
310, The test accesses converted.gate.in_features which breaks for transformers
≥5.0 where converted.gate is a TopKRouter; update the hidden_size/top_k
retrieval to branch on TRANSFORMERS_VERSION_GE_5_0 (same pattern used earlier
lines 264–284): when TRANSFORMERS_VERSION_GE_5_0 use the TopKRouter's input-size
attribute and top_k attribute as used previously, otherwise use
converted.gate.in_features and converted.gate.top_k; ensure you reference
converted.gate and converted.top_k (or converted.gate.top_k) consistently so the
subsequent tensor x creation and top_k calculation work for both versions.
What does this PR do?
Type of change: New feature, new tests
Overview: Replace hardcoded per-model MoE class registrations (Mixtral, Qwen2Moe, Qwen3Moe, Qwen3Next, Llama4TextMoe, Qwen3VLMoe, MiniMaxM2, etc.) with a single generic auto-detection mechanism (
register_sparse_moe_on_the_fly) that walks the model tree and identifies MoE blocks by their structural attributes (gate+expertswithtop_k/num_experts). This makes MoE quantization forward-compatible with new HuggingFace MoE architectures without requiring explicit registration for each model family.Additionally, this PR:
save_expert_token_count_table), highlighting under-utilized experts.topk->top_kattribute name for transformers >= 5.0 compatibility.Usage
Auto-detection is transparent -- no user-facing API changes are needed. Any HuggingFace MoE model with the standard
gate/expertspattern is automatically detected and quantized:import modelopt.torch.quantization as mtq
Any HuggingFace MoE model (Mixtral, Qwen3Moe, DeepSeek, etc.)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-30B-A3B")
mtq.quantize(model, mtq.INT8_DEFAULT_CFG, forward_loop)
During export, an .moe.html report with per-expert token counts is saved automatically
Testing
unittest, also test exporting qwen MOE
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
New Features
Tests