Skip to content

Auto detect MOE layers#900

Open
cjluo-nv wants to merge 5 commits intomainfrom
chenjiel/moe_detection
Open

Auto detect MOE layers#900
cjluo-nv wants to merge 5 commits intomainfrom
chenjiel/moe_detection

Conversation

@cjluo-nv
Copy link
Collaborator

@cjluo-nv cjluo-nv commented Feb 17, 2026

What does this PR do?

Type of change: New feature, new tests

Overview: Replace hardcoded per-model MoE class registrations (Mixtral, Qwen2Moe, Qwen3Moe, Qwen3Next, Llama4TextMoe, Qwen3VLMoe, MiniMaxM2, etc.) with a single generic auto-detection mechanism (register_sparse_moe_on_the_fly) that walks the model tree and identifies MoE blocks by their structural attributes (gate + experts with top_k/num_experts). This makes MoE quantization forward-compatible with new HuggingFace MoE architectures without requiring explicit registration for each model family.

Additionally, this PR:

  • Tracks per-expert token routing counts during calibration via a gate forward hook, enabling visibility into expert utilization.
  • Saves an HTML report of expert token counts during export (save_expert_token_count_table), highlighting under-utilized experts.
  • Fixes the topk -> top_k attribute name for transformers >= 5.0 compatibility.

Usage

Auto-detection is transparent -- no user-facing API changes are needed. Any HuggingFace MoE model with the standard gate/experts pattern is automatically detected and quantized:

import modelopt.torch.quantization as mtq

Any HuggingFace MoE model (Mixtral, Qwen3Moe, DeepSeek, etc.)

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-30B-A3B")

mtq.quantize(model, mtq.INT8_DEFAULT_CFG, forward_loop)

During export, an .moe.html report with per-expert token counts is saved automatically

Testing

unittest, also test exporting qwen MOE

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

  • New Features

    • Added expert token count visualization for Mixture of Experts models, exported as HTML reports during model export.
    • Enhanced sparse MoE quantization with improved calibration-aware routing and automatic model block detection.
  • Tests

    • Added comprehensive test suite for sparse MoE quantization validation.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 17, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 17, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR introduces MoE expert token counting and visualization capabilities by adding a utility to generate HTML reports of per-expert token distribution, integrating token tracking into the MoE quantization pipeline with calibration support, implementing MoE block detection, and providing comprehensive test coverage for the new functionality.

Changes

Cohort / File(s) Summary
MoE Token Counting Utilities
modelopt/torch/export/moe_utils.py, modelopt/torch/export/unified_export_hf.py
New utility function generates HTML tables visualizing expert token counts per MoE layer with conditional cell styling for imbalanced routing. Integrated into HF export pipeline to save tables during checkpoint export.
MoE Quantization Plugin Enhancements
modelopt/torch/quantization/plugins/huggingface.py
Enhanced _QuantSparseMoe with calibration-mode support, expert token tracking via forward hooks, and temporary full-expert routing. Added _is_sparse_moe_block detector for structural MoE identification. Replaced minimax-specific registration with register_sparse_moe_on_the_fly for generic sparse MoE detection.
Sparse MoE Test Coverage
tests/unit/torch/quantization/plugins/test_sparse_moe.py
Comprehensive test suite validating MoE block detection logic, expert token count initialization, calibration routing behavior, token accumulation via forward hooks, and integration with QuantModuleRegistry.

Sequence Diagram

sequenceDiagram
    participant Model as Model
    participant MoE as _QuantSparseMoe
    participant Gate as Gate/Router
    participant Hook as Forward Hook
    participant Export as HF Export

    Model->>MoE: forward(is_calib=True)
    MoE->>MoE: Expand gate.top_k to num_experts
    MoE->>Gate: forward(all tokens to all experts)
    Gate->>Hook: trigger forward hook
    Hook->>Hook: accumulate per-expert token counts
    Hook-->>MoE: return
    MoE->>MoE: Restore original gate.top_k
    MoE->>Gate: forward(normal routing with is_calib=False)
    Gate-->>MoE: return routed output
    MoE-->>Model: return output

    Model->>Export: export_hf_checkpoint(model)
    Export->>MoE: read expert_token_count
    Export->>Export: generate HTML token distribution table
    Export->>Export: write .moe.html file
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 47.06% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Auto detect MOE layers' directly summarizes the main change in the changeset: implementing generic auto-detection of MoE blocks to replace hardcoded per-model registrations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chenjiel/moe_detection

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Feb 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.54%. Comparing base (3801923) to head (4b4ef63).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #900      +/-   ##
==========================================
- Coverage   73.74%   73.54%   -0.21%     
==========================================
  Files         199      205       +6     
  Lines       21183    22000     +817     
==========================================
+ Hits        15621    16179     +558     
- Misses       5562     5821     +259     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cjluo-nv cjluo-nv force-pushed the chenjiel/moe_detection branch from 1dd5bb1 to 578ede0 Compare February 17, 2026 23:07
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv cjluo-nv force-pushed the chenjiel/moe_detection branch from 578ede0 to 919be0f Compare February 17, 2026 23:15
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv cjluo-nv marked this pull request as ready for review February 18, 2026 08:08
@cjluo-nv cjluo-nv requested review from a team as code owners February 18, 2026 08:08
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv cjluo-nv requested a review from a team as a code owner February 18, 2026 08:15
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/torch/quantization/plugins/huggingface.py (1)

490-506: ⚠️ Potential issue | 🟠 Major

assert hasattr(self, "gate") will fire for fallback-detected blocks on transformers ≥ 5.0.

_is_sparse_moe_block has two detection paths:

  1. Primary: gate.top_k + gate.num_experts — gate is present.
  2. Fallback: block.top_k + block.num_experts (no gate required, e.g. gate is a plain nn.Linear).

When a module is registered via the fallback path and TRANSFORMERS_VERSION_GE_5_0 = True, the forward enters the >= 5.0 branch at Line 490 which immediately asserts hasattr(self, "gate"). A plain-linear gate module satisfies the assertion, but its gate.top_k and gate.num_experts attributes won't exist, causing AttributeError on Line 494. The >= 5.0 branch should only be taken for modules where gate genuinely exposes top_k/num_experts (the primary detection path).

🛡️ Proposed fix: guard v5.0 branch on gate attribute structure
-if TRANSFORMERS_VERSION_GE_5_0:
-    assert hasattr(self, "gate")
+if TRANSFORMERS_VERSION_GE_5_0 and hasattr(self, "gate") and hasattr(self.gate, "top_k"):
     # Path for transformers >= 5.0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 490 - 506,
The current forward branch for TRANSFORMERS_VERSION_GE_5_0 wrongly assumes
self.gate always exposes top_k and num_experts; replace the assert with a
conditional that checks hasattr(self, "gate") and hasattr(self.gate, "top_k")
and hasattr(self.gate, "num_experts") and only then run the "gate" path
(save/modify gate.top_k, call super().forward, restore); otherwise fall back to
the legacy path that uses self.top_k/self.num_experts/self.experts (same logic
as the else branch). Update the forward method around the
TRANSFORMERS_VERSION_GE_5_0 check to guard on the gate attribute structure
instead of unconditionally taking the v5 branch.
🧹 Nitpick comments (1)
modelopt/torch/export/moe_utils.py (1)

73-73: Specify encoding in write_text for cross-platform safety.

Path.write_text defaults to the locale encoding (which can be non-UTF-8 on Windows). Since the HTML content may contain model-layer names with non-ASCII characters in principle, explicitly specifying encoding="utf-8" is the safer choice and aligns the <meta charset> expectation of any browser opening the file.

♻️ Proposed fix
-output_path.write_text(html_content)
+output_path.write_text(html_content, encoding="utf-8")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/moe_utils.py` at line 73, The call to Path.write_text
when saving the generated HTML uses the system locale by default; update the
write operation that calls output_path.write_text(html_content) to explicitly
pass encoding="utf-8" so the saved file matches the HTML meta charset and avoids
non-ASCII issues—locate the write in moe_utils.py where html_content is written
to output_path and add the encoding argument to the write_text call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/export/moe_utils.py`:
- Around line 41-52: The code assumes a constant expert count by using
num_experts = rows[0][1].shape[0] to build the table header, which breaks when
MoE layers have different expert counts; change the header generation to compute
the maximum expert count across rows (e.g., max(r[1].shape[0] for r in rows))
and use that to create html_parts header cells, and when emitting each row (the
loop that writes per-layer <td> cells) pad shorter rows with empty <td></td>
cells (or colspan equivalently) up to that max so every row has the same number
of columns and the table remains aligned. Ensure you update references to
num_experts accordingly and keep html_parts.extend and row emission logic
consistent.

In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1004-1007: The call to save_expert_token_count_table is inside the
same try as _export_transformers_checkpoint so any failure aborts the whole
export and prevents model.save_pretrained from running; move or guard this
diagnostic write so it cannot raise out of the main export flow — either call
save_expert_token_count_table after the try/except that handles
_export_transformers_checkpoint, or wrap
save_expert_token_count_table(export_dir, model) in its own try/except that
catches Exception, logs a non-fatal warning (including the exception), and
continues; ensure post_state_dict and hf_quant_config from
_export_transformers_checkpoint remain unaffected and model.save_pretrained is
always called even if the .moe.html write fails.

In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 452-465: The _setup method can silently create an empty
expert_token_count when num_experts remains 0; update _setup (method name:
_setup, symbols: self.gate, gate.num_experts, self.num_experts, self.experts,
experts.num_experts, expert_token_count, save_expert_token_count_table) to
detect when num_experts == 0 after the three lookups and emit a clear warning
(e.g., warnings.warn or self.logger.warning) indicating the layer's routing
won't be tracked and naming the module (use self.__class__.__name__ or similar),
then proceed (or skip registering the hook) so callers know to fix the model
instead of silently losing tracking.

In `@tests/unit/torch/quantization/plugins/test_sparse_moe.py`:
- Around line 193-199: The test's expected_num_experts currently only checks
moe_block.num_experts which doesn't follow the _setup resolution; update the
test around QuantModuleRegistry.convert and expected_num_experts to mirror
_setup's fallback order by resolving num_experts as gate.num_experts →
moe_block.num_experts (self.num_experts) → moe_block.experts.num_experts, then
assert converted.expert_token_count.shape equals (resolved_num_experts,) and
dtype/zero-content as before so the assertion reflects the same resolution logic
used in _setup.
- Around line 295-310: The test accesses converted.gate.in_features which breaks
for transformers ≥5.0 where converted.gate is a TopKRouter; update the
hidden_size/top_k retrieval to branch on TRANSFORMERS_VERSION_GE_5_0 (same
pattern used earlier lines 264–284): when TRANSFORMERS_VERSION_GE_5_0 use the
TopKRouter's input-size attribute and top_k attribute as used previously,
otherwise use converted.gate.in_features and converted.gate.top_k; ensure you
reference converted.gate and converted.top_k (or converted.gate.top_k)
consistently so the subsequent tensor x creation and top_k calculation work for
both versions.

---

Outside diff comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 490-506: The current forward branch for
TRANSFORMERS_VERSION_GE_5_0 wrongly assumes self.gate always exposes top_k and
num_experts; replace the assert with a conditional that checks hasattr(self,
"gate") and hasattr(self.gate, "top_k") and hasattr(self.gate, "num_experts")
and only then run the "gate" path (save/modify gate.top_k, call super().forward,
restore); otherwise fall back to the legacy path that uses
self.top_k/self.num_experts/self.experts (same logic as the else branch). Update
the forward method around the TRANSFORMERS_VERSION_GE_5_0 check to guard on the
gate attribute structure instead of unconditionally taking the v5 branch.

---

Nitpick comments:
In `@modelopt/torch/export/moe_utils.py`:
- Line 73: The call to Path.write_text when saving the generated HTML uses the
system locale by default; update the write operation that calls
output_path.write_text(html_content) to explicitly pass encoding="utf-8" so the
saved file matches the HTML meta charset and avoids non-ASCII issues—locate the
write in moe_utils.py where html_content is written to output_path and add the
encoding argument to the write_text call.

Comment on lines +41 to +52
num_experts = rows[0][1].shape[0]
html_parts = [
"<html><head><style>",
"table { border-collapse: collapse; font-family: monospace; }",
"th, td { border: 1px solid #ccc; padding: 4px 8px; text-align: right; }",
"th { background: #f0f0f0; }",
"</style></head><body>",
"<h2>Expert Token Counts (per MoE layer)</h2>",
"<table><tr><th>Layer/Expert</th>",
]
html_parts.extend(f"<th>{i}</th>" for i in range(num_experts))
html_parts.append("</tr>")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Assumes all MoE layers have the same number of experts.

num_experts = rows[0][1].shape[0] is used to build the table header once, but subsequent rows each emit one <td> per their own expert count. If any two MoE layers in a model have different expert counts (e.g., fine-grained vs. shared experts in DeepSeek-style models, or future heterogeneous architectures), the HTML table will have misaligned columns — rows with fewer experts than the header will be missing cells, and rows with more will overflow.

🛡️ Proposed fix: build header dynamically from max width
-num_experts = rows[0][1].shape[0]
+num_experts = max(counts.shape[0] for _, counts in rows)
 html_parts = [
     ...
     "<table><tr><th>Layer/Expert</th>",
 ]
 html_parts.extend(f"<th>{i}</th>" for i in range(num_experts))

With this fix, each row loop may also need padding:

 for c in counts.tolist():
     ...
     html_parts.append(f"<td{style}>{c}</td>")
+# Pad missing expert columns for layers with fewer experts
+for _ in range(num_experts - len(counts)):
+    html_parts.append("<td>-</td>")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/moe_utils.py` around lines 41 - 52, The code assumes a
constant expert count by using num_experts = rows[0][1].shape[0] to build the
table header, which breaks when MoE layers have different expert counts; change
the header generation to compute the maximum expert count across rows (e.g.,
max(r[1].shape[0] for r in rows)) and use that to create html_parts header
cells, and when emitting each row (the loop that writes per-layer <td> cells)
pad shorter rows with empty <td></td> cells (or colspan equivalently) up to that
max so every row has the same number of columns and the table remains aligned.
Ensure you update references to num_experts accordingly and keep
html_parts.extend and row emission logic consistent.

Comment on lines 1004 to 1007
try:
post_state_dict, hf_quant_config = _export_transformers_checkpoint(model, dtype)

save_expert_token_count_table(model, export_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

save_expert_token_count_table failure aborts model export.

The call sits inside the try block alongside _export_transformers_checkpoint. If writing .moe.html fails for any reason (disk full, permission error, unexpected exception for an edge-case module), the exception propagates to the outer except, which warns and re-raises — preventing model.save_pretrained from running. A diagnostic HTML report should not gate the primary export operation.

♻️ Proposed fix: isolate the diagnostic step
+        try:
             save_expert_token_count_table(model, export_dir)
+        except Exception as report_err:
+            warnings.warn(
+                f"Failed to save expert token count table: {report_err}. "
+                "Model export will continue."
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/unified_export_hf.py` around lines 1004 - 1007, The
call to save_expert_token_count_table is inside the same try as
_export_transformers_checkpoint so any failure aborts the whole export and
prevents model.save_pretrained from running; move or guard this diagnostic write
so it cannot raise out of the main export flow — either call
save_expert_token_count_table after the try/except that handles
_export_transformers_checkpoint, or wrap
save_expert_token_count_table(export_dir, model) in its own try/except that
catches Exception, logs a non-fatal warning (including the exception), and
continues; ensure post_state_dict and hf_quant_config from
_export_transformers_checkpoint remain unaffected and model.save_pretrained is
always called even if the .moe.html write fails.

Comment on lines 452 to +465
def _setup(self):
pass
num_experts = 0
if hasattr(self, "gate") and hasattr(self.gate, "num_experts"):
num_experts = self.gate.num_experts
elif hasattr(self, "num_experts"):
num_experts = self.num_experts
elif hasattr(self, "experts") and hasattr(self.experts, "num_experts"):
num_experts = self.experts.num_experts

self.expert_token_count = torch.zeros(num_experts, dtype=torch.long, device="cpu")
self._count_expert_tokens = False

if hasattr(self, "gate"):
self.gate.register_forward_hook(self._gate_forward_hook)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silent expert_token_count of shape (0,) when num_experts cannot be resolved.

If none of the three attribute lookups (gate.num_experts, self.num_experts, experts.num_experts) succeeds, num_experts stays 0 and expert_token_count is silently initialized to an empty tensor. The save_expert_token_count_table function will silently skip this layer (due to the numel() > 0 guard), giving no indication that the layer's routing was not tracked. A warning here would help debugging.

🛡️ Proposed fix
+import warnings
 ...
 self.expert_token_count = torch.zeros(num_experts, dtype=torch.long, device="cpu")
+if num_experts == 0:
+    warnings.warn(
+        f"Could not determine num_experts for {type(self).__name__}; "
+        "expert_token_count will not be tracked.",
+        stacklevel=2,
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 452 - 465,
The _setup method can silently create an empty expert_token_count when
num_experts remains 0; update _setup (method name: _setup, symbols: self.gate,
gate.num_experts, self.num_experts, self.experts, experts.num_experts,
expert_token_count, save_expert_token_count_table) to detect when num_experts ==
0 after the three lookups and emit a clear warning (e.g., warnings.warn or
self.logger.warning) indicating the layer's routing won't be tracked and naming
the module (use self.__class__.__name__ or similar), then proceed (or skip
registering the hook) so callers know to fix the model instead of silently
losing tracking.

Comment on lines +193 to +199

converted = QuantModuleRegistry.convert(moe_block)
assert hasattr(converted, "expert_token_count")
expected_num_experts = moe_block.num_experts if hasattr(moe_block, "num_experts") else 0
assert converted.expert_token_count.shape == (expected_num_experts,)
assert converted.expert_token_count.dtype == torch.long
assert (converted.expert_token_count == 0).all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

expected_num_experts doesn't mirror _setup's fallback chain, creating a weak assertion.

_setup resolves num_experts in this order: gate.num_expertsself.num_expertsexperts.num_experts. The test at Line 196 only checks moe_block.num_experts (block-level). For Qwen3 MoE on transformers ≥ 5.0, if num_experts lives on gate.num_experts and not at block level, expected_num_experts becomes 0 and the assertion converted.expert_token_count.shape == (0,) vacuously passes even when _setup correctly initialized a non-zero count.

🛡️ Proposed fix: mirror the _setup resolution order
-expected_num_experts = moe_block.num_experts if hasattr(moe_block, "num_experts") else 0
+if hasattr(moe_block, "gate") and hasattr(moe_block.gate, "num_experts"):
+    expected_num_experts = moe_block.gate.num_experts
+elif hasattr(moe_block, "num_experts"):
+    expected_num_experts = moe_block.num_experts
+elif hasattr(moe_block, "experts") and hasattr(moe_block.experts, "num_experts"):
+    expected_num_experts = moe_block.experts.num_experts
+else:
+    expected_num_experts = 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/torch/quantization/plugins/test_sparse_moe.py` around lines 193 -
199, The test's expected_num_experts currently only checks moe_block.num_experts
which doesn't follow the _setup resolution; update the test around
QuantModuleRegistry.convert and expected_num_experts to mirror _setup's fallback
order by resolving num_experts as gate.num_experts → moe_block.num_experts
(self.num_experts) → moe_block.experts.num_experts, then assert
converted.expert_token_count.shape equals (resolved_num_experts,) and
dtype/zero-content as before so the assertion reflects the same resolution logic
used in _setup.

Comment on lines +295 to +310
converted = QuantModuleRegistry.convert(moe_block)

# Reset counts and enable counting
converted.expert_token_count.zero_()
converted._count_expert_tokens = True

hidden_size = converted.gate.in_features
x = torch.randn(8, hidden_size)
with torch.no_grad():
converted.gate(x)

# After one gate call with counting enabled, total assigned tokens should equal
# num_tokens * top_k
top_k = converted.top_k if hasattr(converted, "top_k") else converted.gate.top_k
total_assigned = converted.expert_token_count.sum().item()
assert total_assigned == 8 * top_k
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, check if the test file exists and read the relevant section
fd -t f "test_sparse_moe.py" --exec cat -n {} \; | head -350 | tail -100

Repository: NVIDIA/Model-Optimizer

Length of output: 4523


🏁 Script executed:

# Search for how gate is set up and what type it might be
rg "converted\.gate|TopKRouter" -A 3 -B 3 tests/

Repository: NVIDIA/Model-Optimizer

Length of output: 6136


🏁 Script executed:

# Check if TRANSFORMERS_VERSION_GE_5_0 is defined in the codebase
rg "TRANSFORMERS_VERSION_GE_5_0" --type py

Repository: NVIDIA/Model-Optimizer

Length of output: 591


🏁 Script executed:

# Check imports in the test file to understand dependencies
fd -t f "test_sparse_moe.py" -exec head -50 {} \;

Repository: NVIDIA/Model-Optimizer

Length of output: 296


🏁 Script executed:

# Find _FakeGateWithRouter class definition to understand TopKRouter interface
rg "_FakeGateWithRouter" -A 20 tests/unit/torch/quantization/plugins/test_sparse_moe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1497


🏁 Script executed:

# Check if there's a way to get hidden_size from TopKRouter elsewhere in codebase
rg "gate\." tests/unit/torch/quantization/plugins/test_sparse_moe.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 815


🏁 Script executed:

# Search for how gate's hidden size is obtained in other parts of the codebase
rg "hidden.*size|in_features" tests/unit/torch/quantization/plugins/test_sparse_moe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 154


🏁 Script executed:

# Check the sparse_moe.py implementation file to see how gate is handled
fd -t f "sparse_moe.py" -path "*/quantization/plugins/*" --exec head -100 {} \;

Repository: NVIDIA/Model-Optimizer

Length of output: 234


🏁 Script executed:

# Find sparse_moe.py files
find . -name "sparse_moe.py" -type f

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Check the actual implementation to see how gate is created and used
rg "class.*MoE|def convert" modelopt/torch/quantization/plugins/ -A 5 | grep -A 20 "sparse"

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Look at the implementation to understand gate structure in v5.0+
rg "in_features|hidden_size" modelopt/torch/quantization/plugins/ --type py | grep -i gate

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Search for sparse_moe implementation files
find . -type f -name "*.py" | xargs grep -l "sparse.*moe\|SparseMoE" | head -10

Repository: NVIDIA/Model-Optimizer

Length of output: 258


🏁 Script executed:

# Look for the implementation that wraps/converts the MoE block
rg "class.*MoE.*Block|convert.*moe" modelopt/torch/quantization/plugins/ -B 2 -A 10 --type py

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Check what attributes TopKRouter should have by looking at how it's used throughout
rg "gate\.in_features|gate\.hidden" . --type py

Repository: NVIDIA/Model-Optimizer

Length of output: 173


🏁 Script executed:

# Look at what the converted MoE block looks like - search for expert_token_count
rg "expert_token_count" . --type py -B 5 -A 5

Repository: NVIDIA/Model-Optimizer

Length of output: 13178


🏁 Script executed:

# Look at huggingface.py for how they handle gate attributes
head -150 modelopt/torch/quantization/plugins/huggingface.py

Repository: NVIDIA/Model-Optimizer

Length of output: 5604


🏁 Script executed:

# Check if there's any documentation or comments about TopKRouter's attributes
rg "TopKRouter|linear\." modelopt/torch/quantization/plugins/huggingface.py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 1084


🏁 Script executed:

# Check what the actual converted gate looks like - look for wrap/conversion logic
rg "def convert|class.*QuantMoE" modelopt/torch/quantization/plugins/huggingface.py -A 10

Repository: NVIDIA/Model-Optimizer

Length of output: 1216


🏁 Script executed:

# Check the _FakeGateWithRouter more carefully to understand what attributes TopKRouter should have
rg "_FakeGateWithRouter" -A 30 tests/unit/torch/quantization/plugins/test_sparse_moe.py | head -50

Repository: NVIDIA/Model-Optimizer

Length of output: 1746


🏁 Script executed:

# Look for how hidden_size or input size is accessed elsewhere in the test file
rg "hidden_size|hidden.*state|input.*size" tests/unit/torch/quantization/plugins/test_sparse_moe.py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 1183


🏁 Script executed:

# Check how the gate is used in the actual implementation - what shape does it expect
rg "gate.*forward\|gate\(" modelopt/torch/quantization/plugins/huggingface.py -B 3 -A 3

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Look at the test imports to understand the version
head -50 tests/unit/torch/quantization/plugins/test_sparse_moe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1799


🏁 Script executed:

# Look at what the actual Qwen3 model looks like when it's loaded
rg "get_tiny_qwen3_moe" tests/ -A 5 | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 2612


🏁 Script executed:

# Search for actual TopKRouter usage in transformers or the codebase with real models
rg "from transformers" modelopt/torch/quantization/plugins/huggingface.py | head -10

Repository: NVIDIA/Model-Optimizer

Length of output: 600


🏁 Script executed:

# Check if there's a way to get the hidden dimension from gate - look for other gate attribute accesses
rg "gate\." modelopt/torch/quantization/plugins/huggingface.py | grep -v "hasattr\|gate\.top_k\|gate\.num_experts"

Repository: NVIDIA/Model-Optimizer

Length of output: 134


🏁 Script executed:

# Look at the full test to understand the context better - what does converted look like
sed -n '238,320p' tests/unit/torch/quantization/plugins/test_sparse_moe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 3163


🏁 Script executed:

# Search for actual TopKRouter implementation or how it's used with hidden_size
rg "class.*TopKRouter|hidden_size" modelopt/ --type py -B 2 -A 5

Repository: NVIDIA/Model-Optimizer

Length of output: 50379


🏁 Script executed:

# Check the _get_moe_block helper to understand what model is being used
rg "_get_moe_block" tests/unit/torch/quantization/plugins/test_sparse_moe.py -A 10

Repository: NVIDIA/Model-Optimizer

Length of output: 3095


🏁 Script executed:

# Look for config attributes that might hold hidden size
rg "config\.|hidden.*dim" tests/unit/torch/quantization/plugins/test_sparse_moe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Search for how the hidden size is determined elsewhere when dealing with gates
rg "hidden.*size|in_features" modelopt/torch/quantization/plugins/ --type py -B 3 -A 3

Repository: NVIDIA/Model-Optimizer

Length of output: 11026


🏁 Script executed:

# Let me look at how converted.gate is used in other parts of the test to infer its structure
sed -n '1,100p' tests/unit/torch/quantization/plugins/test_sparse_moe.py | tail -50

Repository: NVIDIA/Model-Optimizer

Length of output: 1929


🏁 Script executed:

# Check what the actual MoE block structure looks like by searching for how hidden_size is obtained elsewhere
rg "moe_block\.|converted\." tests/unit/torch/quantization/plugins/test_sparse_moe.py | grep -v "assert\|hasattr\|if"

Repository: NVIDIA/Model-Optimizer

Length of output: 637


🏁 Script executed:

# Search for how gate input size is determined in the implementation
rg "gate.*\(" modelopt/torch/quantization/plugins/huggingface.py -B 5 | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 1675


Add TRANSFORMERS_VERSION_GE_5_0 check before accessing gate attributes at line 301.

Line 301 accesses converted.gate.in_features without version checking. For transformers ≥ 5.0, converted.gate is a TopKRouter (not nn.Linear), so in_features is not a direct attribute. The test already uses this version check pattern at lines 264–284 for handling the same gate type difference; apply the same pattern here to get the input size correctly for both versions.

The exact attribute on TopKRouter that holds the input size should be verified from your transformers installation, but the structure should match the pattern already in use for top_k access:

-hidden_size = converted.gate.in_features
+if TRANSFORMERS_VERSION_GE_5_0:
+    hidden_size = converted.gate.linear.in_features  # or appropriate TopKRouter attr
+else:
+    hidden_size = converted.gate.in_features
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/torch/quantization/plugins/test_sparse_moe.py` around lines 295 -
310, The test accesses converted.gate.in_features which breaks for transformers
≥5.0 where converted.gate is a TopKRouter; update the hidden_size/top_k
retrieval to branch on TRANSFORMERS_VERSION_GE_5_0 (same pattern used earlier
lines 264–284): when TRANSFORMERS_VERSION_GE_5_0 use the TopKRouter's input-size
attribute and top_k attribute as used previously, otherwise use
converted.gate.in_features and converted.gate.top_k; ensure you reference
converted.gate and converted.top_k (or converted.gate.top_k) consistently so the
subsequent tensor x creation and top_k calculation work for both versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant