Auto detect MOE layers by cjluo-nv · Pull Request #900 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-02-17T22:55:21Z

What does this PR do?

Type of change: New feature, new tests

Overview: Replace hardcoded per-model MoE class registrations (Mixtral, Qwen2Moe, Qwen3Moe, Qwen3Next, Llama4TextMoe, Qwen3VLMoe, MiniMaxM2, etc.) with a single generic auto-detection mechanism (register_sparse_moe_on_the_fly) that walks the model tree and identifies MoE blocks by their structural attributes (gate + experts with top_k/num_experts). This makes MoE quantization forward-compatible with new HuggingFace MoE architectures without requiring explicit registration for each model family.

Additionally, this PR:

Tracks per-expert token routing counts during calibration via a gate forward hook, enabling visibility into expert utilization.
Saves an HTML report of expert token counts during export (save_expert_token_count_table), highlighting under-utilized experts.
Fixes the topk -> top_k attribute name for transformers >= 5.0 compatibility.

Usage

Auto-detection is transparent -- no user-facing API changes are needed. Any HuggingFace MoE model with the standard gate/experts pattern is automatically detected and quantized:

import modelopt.torch.quantization as mtq

Any HuggingFace MoE model (Mixtral, Qwen3Moe, DeepSeek, etc.)

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-30B-A3B")

mtq.quantize(model, mtq.INT8_DEFAULT_CFG, forward_loop)

During export, an .moe.html report with per-expert token counts is saved automatically

Testing

unittest, also test exporting qwen MOE

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

New Features
- Added expert token count visualization for Mixture of Experts models, exported as HTML reports during model export.
- Enhanced sparse MoE quantization with improved calibration-aware routing and automatic model block detection.
Tests
- Added comprehensive test suite for sparse MoE quantization validation.

copy-pr-bot · 2026-02-17T22:55:25Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-02-17T22:55:27Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

The PR introduces MoE expert token counting and visualization capabilities by adding a utility to generate HTML reports of per-expert token distribution, integrating token tracking into the MoE quantization pipeline with calibration support, implementing MoE block detection, and providing comprehensive test coverage for the new functionality.

Changes

Cohort / File(s)	Summary
MoE Token Counting Utilities `modelopt/torch/export/moe_utils.py`, `modelopt/torch/export/unified_export_hf.py`	New utility function generates HTML tables visualizing expert token counts per MoE layer with conditional cell styling for imbalanced routing. Integrated into HF export pipeline to save tables during checkpoint export.
MoE Quantization Plugin Enhancements `modelopt/torch/quantization/plugins/huggingface.py`	Enhanced `_QuantSparseMoe` with calibration-mode support, expert token tracking via forward hooks, and temporary full-expert routing. Added `_is_sparse_moe_block` detector for structural MoE identification. Replaced minimax-specific registration with `register_sparse_moe_on_the_fly` for generic sparse MoE detection.
Sparse MoE Test Coverage `tests/unit/torch/quantization/plugins/test_sparse_moe.py`	Comprehensive test suite validating MoE block detection logic, expert token count initialization, calibration routing behavior, token accumulation via forward hooks, and integration with QuantModuleRegistry.

Sequence Diagram

sequenceDiagram
    participant Model as Model
    participant MoE as _QuantSparseMoe
    participant Gate as Gate/Router
    participant Hook as Forward Hook
    participant Export as HF Export

    Model->>MoE: forward(is_calib=True)
    MoE->>MoE: Expand gate.top_k to num_experts
    MoE->>Gate: forward(all tokens to all experts)
    Gate->>Hook: trigger forward hook
    Hook->>Hook: accumulate per-expert token counts
    Hook-->>MoE: return
    MoE->>MoE: Restore original gate.top_k
    MoE->>Gate: forward(normal routing with is_calib=False)
    Gate-->>MoE: return routed output
    MoE-->>Model: return output

    Model->>Export: export_hf_checkpoint(model)
    Export->>MoE: read expert_token_count
    Export->>Export: generate HTML token distribution table
    Export->>Export: write .moe.html file

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 47.06% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Auto detect MOE layers' directly summarizes the main change in the changeset: implementing generic auto-detection of MoE blocks to replace hardcoded per-model registrations.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch chenjiel/moe_detection

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-02-17T23:07:28Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.54%. Comparing base (3801923) to head (4b4ef63).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #900      +/-   ##
==========================================
- Coverage   73.74%   73.54%   -0.21%     
==========================================
  Files         199      205       +6     
  Lines       21183    22000     +817     
==========================================
+ Hits        15621    16179     +558     
- Misses       5562     5821     +259

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelopt/torch/quantization/plugins/huggingface.py (1)
490-506: ⚠️ Potential issue | 🟠 Major

assert hasattr(self, "gate") will fire for fallback-detected blocks on transformers ≥ 5.0.

_is_sparse_moe_block has two detection paths:

Primary: gate.top_k + gate.num_experts — gate is present.

Fallback: block.top_k + block.num_experts (no gate required, e.g. gate is a plain nn.Linear).

When a module is registered via the fallback path and TRANSFORMERS_VERSION_GE_5_0 = True, the forward enters the >= 5.0 branch at Line 490 which immediately asserts hasattr(self, "gate"). A plain-linear gate module satisfies the assertion, but its gate.top_k and gate.num_experts attributes won't exist, causing AttributeError on Line 494. The >= 5.0 branch should only be taken for modules where gate genuinely exposes top_k/num_experts (the primary detection path).
🛡️ Proposed fix: guard v5.0 branch on gate attribute structure
-if TRANSFORMERS_VERSION_GE_5_0:
-    assert hasattr(self, "gate")
+if TRANSFORMERS_VERSION_GE_5_0 and hasattr(self, "gate") and hasattr(self.gate, "top_k"):
     # Path for transformers >= 5.0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 490 - 506,
The current forward branch for TRANSFORMERS_VERSION_GE_5_0 wrongly assumes
self.gate always exposes top_k and num_experts; replace the assert with a
conditional that checks hasattr(self, "gate") and hasattr(self.gate, "top_k")
and hasattr(self.gate, "num_experts") and only then run the "gate" path
(save/modify gate.top_k, call super().forward, restore); otherwise fall back to
the legacy path that uses self.top_k/self.num_experts/self.experts (same logic
as the else branch). Update the forward method around the
TRANSFORMERS_VERSION_GE_5_0 check to guard on the gate attribute structure
instead of unconditionally taking the v5 branch.

🧹 Nitpick comments (1)

modelopt/torch/export/moe_utils.py (1)
73-73: Specify encoding in write_text for cross-platform safety.

Path.write_text defaults to the locale encoding (which can be non-UTF-8 on Windows). Since the HTML content may contain model-layer names with non-ASCII characters in principle, explicitly specifying encoding="utf-8" is the safer choice and aligns the <meta charset> expectation of any browser opening the file.
♻️ Proposed fix
-output_path.write_text(html_content)
+output_path.write_text(html_content, encoding="utf-8")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/moe_utils.py` at line 73, The call to Path.write_text
when saving the generated HTML uses the system locale by default; update the
write operation that calls output_path.write_text(html_content) to explicitly
pass encoding="utf-8" so the saved file matches the HTML meta charset and avoids
non-ASCII issues—locate the write in moe_utils.py where html_content is written
to output_path and add the encoding argument to the write_text call.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/export/moe_utils.py`:
- Around line 41-52: The code assumes a constant expert count by using
num_experts = rows[0][1].shape[0] to build the table header, which breaks when
MoE layers have different expert counts; change the header generation to compute
the maximum expert count across rows (e.g., max(r[1].shape[0] for r in rows))
and use that to create html_parts header cells, and when emitting each row (the
loop that writes per-layer <td> cells) pad shorter rows with empty <td></td>
cells (or colspan equivalently) up to that max so every row has the same number
of columns and the table remains aligned. Ensure you update references to
num_experts accordingly and keep html_parts.extend and row emission logic
consistent.

In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1004-1007: The call to save_expert_token_count_table is inside the
same try as _export_transformers_checkpoint so any failure aborts the whole
export and prevents model.save_pretrained from running; move or guard this
diagnostic write so it cannot raise out of the main export flow — either call
save_expert_token_count_table after the try/except that handles
_export_transformers_checkpoint, or wrap
save_expert_token_count_table(export_dir, model) in its own try/except that
catches Exception, logs a non-fatal warning (including the exception), and
continues; ensure post_state_dict and hf_quant_config from
_export_transformers_checkpoint remain unaffected and model.save_pretrained is
always called even if the .moe.html write fails.

In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 452-465: The _setup method can silently create an empty
expert_token_count when num_experts remains 0; update _setup (method name:
_setup, symbols: self.gate, gate.num_experts, self.num_experts, self.experts,
experts.num_experts, expert_token_count, save_expert_token_count_table) to
detect when num_experts == 0 after the three lookups and emit a clear warning
(e.g., warnings.warn or self.logger.warning) indicating the layer's routing
won't be tracked and naming the module (use self.__class__.__name__ or similar),
then proceed (or skip registering the hook) so callers know to fix the model
instead of silently losing tracking.

In `@tests/unit/torch/quantization/plugins/test_sparse_moe.py`:
- Around line 193-199: The test's expected_num_experts currently only checks
moe_block.num_experts which doesn't follow the _setup resolution; update the
test around QuantModuleRegistry.convert and expected_num_experts to mirror
_setup's fallback order by resolving num_experts as gate.num_experts →
moe_block.num_experts (self.num_experts) → moe_block.experts.num_experts, then
assert converted.expert_token_count.shape equals (resolved_num_experts,) and
dtype/zero-content as before so the assertion reflects the same resolution logic
used in _setup.
- Around line 295-310: The test accesses converted.gate.in_features which breaks
for transformers ≥5.0 where converted.gate is a TopKRouter; update the
hidden_size/top_k retrieval to branch on TRANSFORMERS_VERSION_GE_5_0 (same
pattern used earlier lines 264–284): when TRANSFORMERS_VERSION_GE_5_0 use the
TopKRouter's input-size attribute and top_k attribute as used previously,
otherwise use converted.gate.in_features and converted.gate.top_k; ensure you
reference converted.gate and converted.top_k (or converted.gate.top_k)
consistently so the subsequent tensor x creation and top_k calculation work for
both versions.

---

Outside diff comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 490-506: The current forward branch for
TRANSFORMERS_VERSION_GE_5_0 wrongly assumes self.gate always exposes top_k and
num_experts; replace the assert with a conditional that checks hasattr(self,
"gate") and hasattr(self.gate, "top_k") and hasattr(self.gate, "num_experts")
and only then run the "gate" path (save/modify gate.top_k, call super().forward,
restore); otherwise fall back to the legacy path that uses
self.top_k/self.num_experts/self.experts (same logic as the else branch). Update
the forward method around the TRANSFORMERS_VERSION_GE_5_0 check to guard on the
gate attribute structure instead of unconditionally taking the v5 branch.

---

Nitpick comments:
In `@modelopt/torch/export/moe_utils.py`:
- Line 73: The call to Path.write_text when saving the generated HTML uses the
system locale by default; update the write operation that calls
output_path.write_text(html_content) to explicitly pass encoding="utf-8" so the
saved file matches the HTML meta charset and avoids non-ASCII issues—locate the
write in moe_utils.py where html_content is written to output_path and add the
encoding argument to the write_text call.

coderabbitai · 2026-02-18T08:17:44Z

modelopt/torch/export/moe_utils.py

+    num_experts = rows[0][1].shape[0]
+    html_parts = [
+        "<html><head><style>",
+        "table { border-collapse: collapse; font-family: monospace; }",
+        "th, td { border: 1px solid #ccc; padding: 4px 8px; text-align: right; }",
+        "th { background: #f0f0f0; }",
+        "</style></head><body>",
+        "<h2>Expert Token Counts (per MoE layer)</h2>",
+        "<table><tr><th>Layer/Expert</th>",
+    ]
+    html_parts.extend(f"<th>{i}</th>" for i in range(num_experts))
+    html_parts.append("</tr>")


⚠️ Potential issue | 🟡 Minor

Assumes all MoE layers have the same number of experts.

num_experts = rows[0][1].shape[0] is used to build the table header once, but subsequent rows each emit one <td> per their own expert count. If any two MoE layers in a model have different expert counts (e.g., fine-grained vs. shared experts in DeepSeek-style models, or future heterogeneous architectures), the HTML table will have misaligned columns — rows with fewer experts than the header will be missing cells, and rows with more will overflow.

🛡️ Proposed fix: build header dynamically from max width

-num_experts = rows[0][1].shape[0] +num_experts = max(counts.shape[0] for _, counts in rows) html_parts = [ ... "<table><tr><th>Layer/Expert</th>", ] html_parts.extend(f"<th>{i}</th>" for i in range(num_experts))

With this fix, each row loop may also need padding:

for c in counts.tolist(): ... html_parts.append(f"<td{style}>{c}</td>") +# Pad missing expert columns for layers with fewer experts +for _ in range(num_experts - len(counts)): + html_parts.append("<td>-</td>")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/moe_utils.py` around lines 41 - 52, The code assumes a constant expert count by using num_experts = rows[0][1].shape[0] to build the table header, which breaks when MoE layers have different expert counts; change the header generation to compute the maximum expert count across rows (e.g., max(r[1].shape[0] for r in rows)) and use that to create html_parts header cells, and when emitting each row (the loop that writes per-layer <td> cells) pad shorter rows with empty <td></td> cells (or colspan equivalently) up to that max so every row has the same number of columns and the table remains aligned. Ensure you update references to num_experts accordingly and keep html_parts.extend and row emission logic consistent.

coderabbitai · 2026-02-18T08:17:44Z

modelopt/torch/export/unified_export_hf.py

    try:
        post_state_dict, hf_quant_config = _export_transformers_checkpoint(model, dtype)

+        save_expert_token_count_table(model, export_dir)


⚠️ Potential issue | 🟠 Major

save_expert_token_count_table failure aborts model export.

The call sits inside the try block alongside _export_transformers_checkpoint. If writing .moe.html fails for any reason (disk full, permission error, unexpected exception for an edge-case module), the exception propagates to the outer except, which warns and re-raises — preventing model.save_pretrained from running. A diagnostic HTML report should not gate the primary export operation.

♻️ Proposed fix: isolate the diagnostic step

+ try: save_expert_token_count_table(model, export_dir) + except Exception as report_err: + warnings.warn( + f"Failed to save expert token count table: {report_err}. " + "Model export will continue." + )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/unified_export_hf.py` around lines 1004 - 1007, The call to save_expert_token_count_table is inside the same try as _export_transformers_checkpoint so any failure aborts the whole export and prevents model.save_pretrained from running; move or guard this diagnostic write so it cannot raise out of the main export flow — either call save_expert_token_count_table after the try/except that handles _export_transformers_checkpoint, or wrap save_expert_token_count_table(export_dir, model) in its own try/except that catches Exception, logs a non-fatal warning (including the exception), and continues; ensure post_state_dict and hf_quant_config from _export_transformers_checkpoint remain unaffected and model.save_pretrained is always called even if the .moe.html write fails.

coderabbitai · 2026-02-18T08:17:44Z

modelopt/torch/quantization/plugins/huggingface.py

    def _setup(self):
-        pass
+        num_experts = 0
+        if hasattr(self, "gate") and hasattr(self.gate, "num_experts"):
+            num_experts = self.gate.num_experts
+        elif hasattr(self, "num_experts"):
+            num_experts = self.num_experts
+        elif hasattr(self, "experts") and hasattr(self.experts, "num_experts"):
+            num_experts = self.experts.num_experts
+
+        self.expert_token_count = torch.zeros(num_experts, dtype=torch.long, device="cpu")
+        self._count_expert_tokens = False
+
+        if hasattr(self, "gate"):
+            self.gate.register_forward_hook(self._gate_forward_hook)


⚠️ Potential issue | 🟡 Minor

Silent expert_token_count of shape (0,) when num_experts cannot be resolved.

If none of the three attribute lookups (gate.num_experts, self.num_experts, experts.num_experts) succeeds, num_experts stays 0 and expert_token_count is silently initialized to an empty tensor. The save_expert_token_count_table function will silently skip this layer (due to the numel() > 0 guard), giving no indication that the layer's routing was not tracked. A warning here would help debugging.

🛡️ Proposed fix

+import warnings ... self.expert_token_count = torch.zeros(num_experts, dtype=torch.long, device="cpu") +if num_experts == 0: + warnings.warn( + f"Could not determine num_experts for {type(self).__name__}; " + "expert_token_count will not be tracked.", + stacklevel=2, + )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 452 - 465, The _setup method can silently create an empty expert_token_count when num_experts remains 0; update _setup (method name: _setup, symbols: self.gate, gate.num_experts, self.num_experts, self.experts, experts.num_experts, expert_token_count, save_expert_token_count_table) to detect when num_experts == 0 after the three lookups and emit a clear warning (e.g., warnings.warn or self.logger.warning) indicating the layer's routing won't be tracked and naming the module (use self.__class__.__name__ or similar), then proceed (or skip registering the hook) so callers know to fix the model instead of silently losing tracking.

coderabbitai · 2026-02-18T08:17:44Z

tests/unit/torch/quantization/plugins/test_sparse_moe.py

+
+        converted = QuantModuleRegistry.convert(moe_block)
+        assert hasattr(converted, "expert_token_count")
+        expected_num_experts = moe_block.num_experts if hasattr(moe_block, "num_experts") else 0
+        assert converted.expert_token_count.shape == (expected_num_experts,)
+        assert converted.expert_token_count.dtype == torch.long
+        assert (converted.expert_token_count == 0).all()


⚠️ Potential issue | 🟡 Minor

expected_num_experts doesn't mirror _setup's fallback chain, creating a weak assertion.

_setup resolves num_experts in this order: gate.num_experts → self.num_experts → experts.num_experts. The test at Line 196 only checks moe_block.num_experts (block-level). For Qwen3 MoE on transformers ≥ 5.0, if num_experts lives on gate.num_experts and not at block level, expected_num_experts becomes 0 and the assertion converted.expert_token_count.shape == (0,) vacuously passes even when _setup correctly initialized a non-zero count.

🛡️ Proposed fix: mirror the _setup resolution order

-expected_num_experts = moe_block.num_experts if hasattr(moe_block, "num_experts") else 0 +if hasattr(moe_block, "gate") and hasattr(moe_block.gate, "num_experts"): + expected_num_experts = moe_block.gate.num_experts +elif hasattr(moe_block, "num_experts"): + expected_num_experts = moe_block.num_experts +elif hasattr(moe_block, "experts") and hasattr(moe_block.experts, "num_experts"): + expected_num_experts = moe_block.experts.num_experts +else: + expected_num_experts = 0

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/unit/torch/quantization/plugins/test_sparse_moe.py` around lines 193 - 199, The test's expected_num_experts currently only checks moe_block.num_experts which doesn't follow the _setup resolution; update the test around QuantModuleRegistry.convert and expected_num_experts to mirror _setup's fallback order by resolving num_experts as gate.num_experts → moe_block.num_experts (self.num_experts) → moe_block.experts.num_experts, then assert converted.expert_token_count.shape equals (resolved_num_experts,) and dtype/zero-content as before so the assertion reflects the same resolution logic used in _setup.

coderabbitai · 2026-02-18T08:17:44Z

tests/unit/torch/quantization/plugins/test_sparse_moe.py

+        converted = QuantModuleRegistry.convert(moe_block)
+
+        # Reset counts and enable counting
+        converted.expert_token_count.zero_()
+        converted._count_expert_tokens = True
+
+        hidden_size = converted.gate.in_features
+        x = torch.randn(8, hidden_size)
+        with torch.no_grad():
+            converted.gate(x)
+
+        # After one gate call with counting enabled, total assigned tokens should equal
+        # num_tokens * top_k
+        top_k = converted.top_k if hasattr(converted, "top_k") else converted.gate.top_k
+        total_assigned = converted.expert_token_count.sum().item()
+        assert total_assigned == 8 * top_k


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, check if the test file exists and read the relevant section fd -t f "test_sparse_moe.py" --exec cat -n {} \; | head -350 | tail -100

Repository: NVIDIA/Model-Optimizer

Length of output: 4523

🏁 Script executed:

# Search for how gate is set up and what type it might be rg "converted\.gate|TopKRouter" -A 3 -B 3 tests/

Repository: NVIDIA/Model-Optimizer

Length of output: 6136

🏁 Script executed:

# Check if TRANSFORMERS_VERSION_GE_5_0 is defined in the codebase rg "TRANSFORMERS_VERSION_GE_5_0" --type py

Repository: NVIDIA/Model-Optimizer

Length of output: 591

🏁 Script executed:

# Check imports in the test file to understand dependencies fd -t f "test_sparse_moe.py" -exec head -50 {} \;

Repository: NVIDIA/Model-Optimizer

Length of output: 296

🏁 Script executed:

# Find _FakeGateWithRouter class definition to understand TopKRouter interface rg "_FakeGateWithRouter" -A 20 tests/unit/torch/quantization/plugins/test_sparse_moe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1497

🏁 Script executed:

# Check if there's a way to get hidden_size from TopKRouter elsewhere in codebase rg "gate\." tests/unit/torch/quantization/plugins/test_sparse_moe.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 815

🏁 Script executed:

# Search for how gate's hidden size is obtained in other parts of the codebase rg "hidden.*size|in_features" tests/unit/torch/quantization/plugins/test_sparse_moe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 154

🏁 Script executed:

# Check the sparse_moe.py implementation file to see how gate is handled fd -t f "sparse_moe.py" -path "*/quantization/plugins/*" --exec head -100 {} \;

Repository: NVIDIA/Model-Optimizer

Length of output: 234

🏁 Script executed:

# Find sparse_moe.py files find . -name "sparse_moe.py" -type f

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check the actual implementation to see how gate is created and used rg "class.*MoE|def convert" modelopt/torch/quantization/plugins/ -A 5 | grep -A 20 "sparse"

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Look at the implementation to understand gate structure in v5.0+ rg "in_features|hidden_size" modelopt/torch/quantization/plugins/ --type py | grep -i gate

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Search for sparse_moe implementation files find . -type f -name "*.py" | xargs grep -l "sparse.*moe\|SparseMoE" | head -10

Repository: NVIDIA/Model-Optimizer

Length of output: 258

🏁 Script executed:

# Look for the implementation that wraps/converts the MoE block rg "class.*MoE.*Block|convert.*moe" modelopt/torch/quantization/plugins/ -B 2 -A 10 --type py

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check what attributes TopKRouter should have by looking at how it's used throughout rg "gate\.in_features|gate\.hidden" . --type py

Repository: NVIDIA/Model-Optimizer

Length of output: 173

🏁 Script executed:

# Look at what the converted MoE block looks like - search for expert_token_count rg "expert_token_count" . --type py -B 5 -A 5

Repository: NVIDIA/Model-Optimizer

Length of output: 13178

🏁 Script executed:

# Look at huggingface.py for how they handle gate attributes head -150 modelopt/torch/quantization/plugins/huggingface.py

Repository: NVIDIA/Model-Optimizer

Length of output: 5604

🏁 Script executed:

# Check if there's any documentation or comments about TopKRouter's attributes rg "TopKRouter|linear\." modelopt/torch/quantization/plugins/huggingface.py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 1084

🏁 Script executed:

# Check what the actual converted gate looks like - look for wrap/conversion logic rg "def convert|class.*QuantMoE" modelopt/torch/quantization/plugins/huggingface.py -A 10

Repository: NVIDIA/Model-Optimizer

Length of output: 1216

🏁 Script executed:

# Check the _FakeGateWithRouter more carefully to understand what attributes TopKRouter should have rg "_FakeGateWithRouter" -A 30 tests/unit/torch/quantization/plugins/test_sparse_moe.py | head -50

Repository: NVIDIA/Model-Optimizer

Length of output: 1746

🏁 Script executed:

# Look for how hidden_size or input size is accessed elsewhere in the test file rg "hidden_size|hidden.*state|input.*size" tests/unit/torch/quantization/plugins/test_sparse_moe.py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 1183

🏁 Script executed:

# Check how the gate is used in the actual implementation - what shape does it expect rg "gate.*forward\|gate\(" modelopt/torch/quantization/plugins/huggingface.py -B 3 -A 3

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Look at the test imports to understand the version head -50 tests/unit/torch/quantization/plugins/test_sparse_moe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1799

🏁 Script executed:

# Look at what the actual Qwen3 model looks like when it's loaded rg "get_tiny_qwen3_moe" tests/ -A 5 | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 2612

🏁 Script executed:

# Search for actual TopKRouter usage in transformers or the codebase with real models rg "from transformers" modelopt/torch/quantization/plugins/huggingface.py | head -10

Repository: NVIDIA/Model-Optimizer

Length of output: 600

🏁 Script executed:

# Check if there's a way to get the hidden dimension from gate - look for other gate attribute accesses rg "gate\." modelopt/torch/quantization/plugins/huggingface.py | grep -v "hasattr\|gate\.top_k\|gate\.num_experts"

Repository: NVIDIA/Model-Optimizer

Length of output: 134

🏁 Script executed:

# Look at the full test to understand the context better - what does converted look like sed -n '238,320p' tests/unit/torch/quantization/plugins/test_sparse_moe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 3163

🏁 Script executed:

# Search for actual TopKRouter implementation or how it's used with hidden_size rg "class.*TopKRouter|hidden_size" modelopt/ --type py -B 2 -A 5

Repository: NVIDIA/Model-Optimizer

Length of output: 50379

🏁 Script executed:

# Check the _get_moe_block helper to understand what model is being used rg "_get_moe_block" tests/unit/torch/quantization/plugins/test_sparse_moe.py -A 10

Repository: NVIDIA/Model-Optimizer

Length of output: 3095

🏁 Script executed:

# Look for config attributes that might hold hidden size rg "config\.|hidden.*dim" tests/unit/torch/quantization/plugins/test_sparse_moe.py

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Search for how the hidden size is determined elsewhere when dealing with gates rg "hidden.*size|in_features" modelopt/torch/quantization/plugins/ --type py -B 3 -A 3

Repository: NVIDIA/Model-Optimizer

Length of output: 11026

🏁 Script executed:

# Let me look at how converted.gate is used in other parts of the test to infer its structure sed -n '1,100p' tests/unit/torch/quantization/plugins/test_sparse_moe.py | tail -50

Repository: NVIDIA/Model-Optimizer

Length of output: 1929

🏁 Script executed:

# Check what the actual MoE block structure looks like by searching for how hidden_size is obtained elsewhere rg "moe_block\.|converted\." tests/unit/torch/quantization/plugins/test_sparse_moe.py | grep -v "assert\|hasattr\|if"

Repository: NVIDIA/Model-Optimizer

Length of output: 637

🏁 Script executed:

# Search for how gate input size is determined in the implementation rg "gate.*\(" modelopt/torch/quantization/plugins/huggingface.py -B 5 | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 1675

Add TRANSFORMERS_VERSION_GE_5_0 check before accessing gate attributes at line 301.

Line 301 accesses converted.gate.in_features without version checking. For transformers ≥ 5.0, converted.gate is a TopKRouter (not nn.Linear), so in_features is not a direct attribute. The test already uses this version check pattern at lines 264–284 for handling the same gate type difference; apply the same pattern here to get the input size correctly for both versions.

The exact attribute on TopKRouter that holds the input size should be verified from your transformers installation, but the structure should match the pattern already in use for top_k access:

-hidden_size = converted.gate.in_features +if TRANSFORMERS_VERSION_GE_5_0: + hidden_size = converted.gate.linear.in_features # or appropriate TopKRouter attr +else: + hidden_size = converted.gate.in_features

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/unit/torch/quantization/plugins/test_sparse_moe.py` around lines 295 - 310, The test accesses converted.gate.in_features which breaks for transformers ≥5.0 where converted.gate is a TopKRouter; update the hidden_size/top_k retrieval to branch on TRANSFORMERS_VERSION_GE_5_0 (same pattern used earlier lines 264–284): when TRANSFORMERS_VERSION_GE_5_0 use the TopKRouter's input-size attribute and top_k attribute as used previously, otherwise use converted.gate.in_features and converted.gate.top_k; ensure you reference converted.gate and converted.top_k (or converted.gate.top_k) consistently so the subsequent tensor x creation and top_k calculation work for both versions.

cjluo-nv force-pushed the chenjiel/moe_detection branch from 1dd5bb1 to 578ede0 Compare February 17, 2026 23:07

Auto detect MOE layers

919be0f

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv force-pushed the chenjiel/moe_detection branch from 578ede0 to 919be0f Compare February 17, 2026 23:15

cjluo-nv added 3 commits February 18, 2026 00:46

Report MOE tokens

8baeaaf

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Update moe activation logics

7da77b9

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Add unittest

2e29ee7

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv marked this pull request as ready for review February 18, 2026 08:08

cjluo-nv requested review from a team as code owners February 18, 2026 08:08

cjluo-nv requested review from Edwardf0t1, Fridah-nv, jingyu-ml, meenchen, realAsma and sugunav14 February 18, 2026 08:08

Update

4b4ef63

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv requested a review from a team as a code owner February 18, 2026 08:15

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto detect MOE layers#900

Auto detect MOE layers#900
cjluo-nv wants to merge 5 commits intomainfrom
chenjiel/moe_detection

cjluo-nv commented Feb 17, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Feb 17, 2026

Uh oh!

coderabbitai bot commented Feb 17, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

codecov bot commented Feb 17, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 18, 2026

Uh oh!

coderabbitai bot Feb 18, 2026

Uh oh!

coderabbitai bot Feb 18, 2026

Uh oh!

coderabbitai bot Feb 18, 2026

Uh oh!

coderabbitai bot Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cjluo-nv commented Feb 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Any HuggingFace MoE model (Mixtral, Qwen3Moe, DeepSeek, etc.)

During export, an .moe.html report with per-expert token counts is saved automatically

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Feb 17, 2026

Uh oh!

coderabbitai bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

codecov bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cjluo-nv commented Feb 17, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 17, 2026 •

edited

Loading

codecov bot commented Feb 17, 2026 •

edited

Loading