Skip to content

Sync with Microsoft ONNX Runtime - 02/03/2026#955

Closed
Jaswanth51 wants to merge 16 commits intoovep-developfrom
sync_msft_02032026
Closed

Sync with Microsoft ONNX Runtime - 02/03/2026#955
Jaswanth51 wants to merge 16 commits intoovep-developfrom
sync_msft_02032026

Conversation

@Jaswanth51
Copy link

Synchronizing intel/onnxruntime ovep-develop branch with latest changes from microsoft/onnxruntime master branch.

Rishi-Dave and others added 16 commits February 27, 2026 06:58
…sion (microsoft#27469)

## Summary
- Fix Cast node naming collisions in `convert_float_to_float16` when
nodes have empty names (common in PyTorch exports)
- Fix `ALWAYS_FLOAT_INPUTS` for opset 10 Resize where scales at index 1
was unprotected
- Add dedicated test suite for float16 conversion (`test_float16.py`, 8
tests)

## Motivation
Fixes microsoft#14827

When `convert_float_to_float16` processes models with unnamed nodes
(empty `node.name`, very common in PyTorch/TensorFlow-exported ONNX
models), the generated Cast node names collide. For example, multiple
Resize nodes all produce Cast nodes named `"_input_cast_2"` and output
tensors named `"_input_cast_2"`, corrupting the graph with duplicate
names.

Additionally, the `ALWAYS_FLOAT_INPUTS` dict only protected Resize
scales at index 2 (opset 11+ layout: `[X, roi, scales, sizes]`), but
opset 10 Resize has scales at index 1 (`[X, scales]`), leaving it
unprotected.

## Changes
**`onnxruntime/python/tools/transformers/float16.py`** (11 lines
changed):
- Use unique tensor names (`input_name`/`output`) as the base for
generated Cast node and output names, instead of potentially-empty
`node.name`
- Add index 1 to `ALWAYS_FLOAT_INPUTS["Resize"]` to protect opset 10
scales
- Fix misleading comment ("change current node's input name" → "output
name")

**`onnxruntime/test/python/transformers/test_float16.py`** (new file, 8
tests):
- `test_resize_opset11_cast_naming_unique` — multiple unnamed Resize
nodes produce unique Cast names
- `test_resize_opset11_scales_initializer_stays_fp32` — scales
initializer preserved as float32
- `test_resize_opset10_scales_initializer_stays_fp32` — opset 10 scales
protected at index 1
- `test_resize_opset10_multiple_unnamed_unique_names` — opset 10 naming
uniqueness
- `test_blocked_node_cast_naming_unique` — blocked op nodes (Upsample)
also get unique Cast names
- `test_resize_with_op_block_list` — Resize in op_block_list still
produces unique names
- `test_data_input_converted_to_fp16` — data tensor correctly converts
to fp16
- `test_force_fp16_initializers` — force flag overrides protection

## Test Plan
- All 8 new tests pass locally (`python -m unittest
test_float16.TestFloat16Conversion -v`)
- Existing `test_gpt2_past_fp16` test passes (no regression in existing
float16 behavior)
- `ruff check` passes on both files
…icrosoft#27458)

## Summary
- Add SDPA-aware pattern matching to `FusionBartAttention` so that BART
attention fusion succeeds on models exported with HuggingFace
Transformers >= 4.49
- Add a synthetic BART SDPA graph generator and unit test

## Motivation
Fixes microsoft#23864

HuggingFace Transformers >= 4.49 replaced `BartAttention` with
`BartSdpaAttention` ([commit
`2c47618`](huggingface/transformers@2c47618)),
changing the ONNX export graph topology in several ways that broke
`FusionBartAttention` pattern matching. Running `optimize_model(...,
model_type="bart")` on these newer exports produces **zero** fused
Attention nodes.

## Changes

### `fusion_bart_attention.py`

The SDPA refactor introduces four structural changes to the attention
subgraph. Each required a new match path:

1. **QKV output path — LayerNormalization anchor fallback**
For SDPA models, symbolic shape inference often fails, which prevents
SkipLayerNormalization fusion. When the anchor node is a plain
`LayerNormalization` instead of `SkipLayerNormalization`, there's an
extra residual `Add` between the LayerNorm and the attention output
projection. Added a fallback match: `["Add", "Add", "MatMul", "Reshape",
"Transpose", "MatMul"]` with `[0, None, 0, 0, 0, 0]`.

2. **QK path — NaN guard (Where + IsNaN)**
SDPA wraps the Softmax output in a NaN guard: `Where(IsNaN(softmax),
0.0, softmax)`. The `Where` node's input[2] is the Softmax output. Added
two new QK paths:
   - No mask: `["Where", "Softmax", "MatMul"]` with `[0, 2, 0]`
- With mask: `["Where", "Softmax", "Add", "MatMul"]` with `[0, 2, 0, 0]`

3. **Q and K scaling paths**
Instead of a single combined scale on the QK MatMul output, SDPA applies
separate `Mul(1/sqrt(head_dim))` to Q and K before the QK MatMul. Added:
- Q path: `["Mul", "Transpose", "Reshape", "Add", "MatMul"]` with `[0,
0, 0, 0, None]`
- K path: `["Mul", "Reshape", "Transpose", "Reshape", "Transpose",
"Reshape", "Add", "MatMul"]` with `[1, 0, 0, 0, 0, 0, 0, None]` (K^T
uses a `Reshape→Transpose(0,2,1)→Reshape` chain)

4. **num_heads fallback for dynamic shapes**
SDPA models use `-1` in reshape shape tensors for dynamic dimensions,
causing `get_num_heads_and_hidden_size` to return negative values. Added
a fallback to user-specified `num_heads`/`hidden_size` when detected
values are invalid.

### `bart_model_generator.py` (new)

Synthetic BART SDPA attention graph generator that builds a minimal but
complete attention subgraph matching the SDPA topology. Tests both
`with_mask=True` (decoder self-attention) and `with_mask=False` (encoder
attention) variants.

### `test_attention_fusion.py`

Added `test_bart_attention_sdpa_fusion` that verifies:
- 1 Attention node is produced for each mask variant
- Correct `num_heads` attribute
- Correct `unidirectional` attribute (1 for decoder self-attention with
mask, 0 for encoder)

## Test Plan
- [x] `python -m pytest test_attention_fusion.py -v` — all 10 tests pass
- [x] `lintrunner` on all 3 changed files — no issues
- [x] Verified on real exported BART SDPA model
(`hf-internal-testing/tiny-random-bart`): 2 Attention nodes fused, graph
reduced from 120 → 34 nodes
…osoft#27428)

Replace and reland microsoft#27129

Comparison between this PR approach and inline in softmax

## Tradeoffs

| Category | Pre-conversion (current) | Inline in softmax |
| :--- | :--- | :--- |
| **Memory** | Extra buffer ($num\_elements \times sizeof(T)$) | None —
reads 1-byte bool directly |
| **Kernel launches** | +1 simple elementwise kernel | Zero extra |
| **Code complexity** | 3 files, ~40 lines added | 6+ kernel templates,
macros, dispatch logic, data structs |
| **Risk** | Low — softmax path untested | High — modifying
battle-tested softmax kernels used by MHA + GQA contrib ops |
| **Perf impact** | Negligible — mask is small vs. QKV; conversion is
memory-bound and fast | Slightly better theoretical bandwidth |
| **Maintainability** | Clean separation of concerns | Adds template
dimension across all softmax variants |

---

This pull request enhances the ONNX Runtime CUDA Attention operator to
support boolean attention masks (bool masks) in the Multi-Head Attention
(MHA) path, converting them to additive attention bias on the GPU. It
also improves test coverage to ensure correctness and parity with the
CPU implementation. The main changes include implementing a CUDA kernel
for mask conversion, updating the operator logic to handle bool masks,
clarifying broadcasting rules, and adding comprehensive unit tests.

**CUDA Attention Operator Improvements:**

* Implemented a CUDA kernel (`LaunchConvertBoolMaskToAttentionBias`)
that converts boolean attention masks to additive bias (True → 0.0,
False → mask_filter_value) for the MHA path, ensuring efficient GPU
execution.
[[1]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49R148-R187)
[[2]](diffhunk://#diff-8aa9a15a92d7dc138346dce5de055911895d940ba2183b4ba45bd95ac0e5bfc9R55-R66)
* Updated `attention.cc` to use this kernel, correctly handle bool masks
in the MHA path, and clarified the broadcasting logic and mask shape
interpretation for both GQA and MHA.
[[1]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffR6)
[[2]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffR380-R383)
[[3]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL514-L522)
[[4]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL549-R557)
[[5]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffR595-R616)

**Testing and Documentation Enhancements:**

* Added new test cases and a dedicated test class to validate the
correctness of boolean mask handling in the MHA path, ensuring parity
with the CPU implementation for 2D, 3D, and 4D mask shapes.
[[1]](diffhunk://#diff-801fbbcf2537e8e13a0202e6a0f7e88c56ab5aa72d17d949a5556355694b2b2dR563-R725)
[[2]](diffhunk://#diff-801fbbcf2537e8e13a0202e6a0f7e88c56ab5aa72d17d949a5556355694b2b2dR893-R922)
* Improved comments and documentation in both code and tests to clarify
ONNX broadcasting rules and mask shape expectations for different
attention paths.
[[1]](diffhunk://#diff-4ed1461afda0d3804a61ba95a64b2a84d0c1395f9c887d1a3fdfed914ade22c1L208-R221)
[[2]](diffhunk://#diff-801fbbcf2537e8e13a0202e6a0f7e88c56ab5aa72d17d949a5556355694b2b2dR35)

**Test Coverage and Reliability:**

* Enabled CUDA-based tests for boolean mask scenarios previously only
tested on CPU, and adjusted test logic to ensure correct handling of
edge cases (e.g., all-false masks).
[[1]](diffhunk://#diff-3ff6dfa2ce407ae0073009174c37d1756509e8bbc434dee7c44cd55a996bb777L477-R480)
[[2]](diffhunk://#diff-3ff6dfa2ce407ae0073009174c37d1756509e8bbc434dee7c44cd55a996bb777L620-R623)

These changes make the CUDA Attention operator more robust and
feature-complete, aligning its behavior with the CPU implementation and
ONNX specifications.
…icrosoft#27470)

Heap-allocate `WebGpuContextFactory::contexts_` to avoid crashes during
static destruction when dependent DLLs (e.g. dxcompiler.dll) have
already been unloaded. `Cleanup()` explicitly deletes the map; on
abnormal termination it safely leaks.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…27480)

### Description

### Motivation and Context

Resolves ongoing security issues with auditing the jar testing pipeline
in the backend.
…7478)

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…icrosoft#27474)

### Description
Improves the error message returned by ORT when loading a compiled model
in a session that does not have the required execution provider(s).

For example, ORT now returns the following error when creating session
with a model compiled explicitly for OpenVINO EP without adding OpenVINO
EP to the session:

> EPContext node generated by 'OpenVINOExecutionProvider' is not
compatible with any execution provider added to the session. EPContext
node name: 'EPContextNode0'. Available session execution providers:
[CPUExecutionProvider].

Compare the above message with the more generic message that is
currently returned:

>Could not find an implementation for EPContext(1) node with name
'EPContextNode0'

### Motivation and Context
Improves diagnosability when loading of pre-compiled models fails.
Specifically, the ambiguity of the original message led to many hours
spent debugging an error where a compiled model failed to run because
the expected `OrtEpDevice` was inadvertently not added to a session.
… compatibility (microsoft#27484)

This pull request refactors validation logic for CUDA attention masks
and tensor scatter operations to move error checking from host-side
(CPU) to device-side (GPU) using CUDA kernel assertions
(`CUDA_KERNEL_ASSERT`). This change eliminates synchronous host-device
memory transfers and stream synchronizations, improving performance and
simplifying code. Corresponding test cases are updated to only expect
validation failures on the CPU, as CUDA errors are now asynchronous.

Key changes:

**Attention mask validation (GQA path):**
- Removes host-side validation and memory copies for boolean attention
masks in `attention.cc`; mask validity (right-padding, contiguous
True/False) is now checked asynchronously via `CUDA_KERNEL_ASSERT` in
the CUDA kernel.
[[1]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL385-L387)
[[2]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL414-L418)
[[3]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL427-L448)
- Updates the CUDA kernel and its interface to drop the
`validation_result` buffer and rely on device assertions for mask
validation. Documentation is updated to reflect this asynchronous error
checking.
[[1]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L10-R17)
[[2]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L34)
[[3]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L81-R76)
[[4]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L104-R92)
[[5]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L118)
[[6]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L137)
[[7]](diffhunk://#diff-8aa9a15a92d7dc138346dce5de055911895d940ba2183b4ba45bd95ac0e5bfc9L37-L45)

**TensorScatter write_indices validation:**
- Removes host-side validation and synchronization for `write_indices`
in `tensorscatter.cc`; index bounds checking is now performed
asynchronously inside the CUDA kernel via `CUDA_KERNEL_ASSERT`.
[[1]](diffhunk://#diff-d69233ff3987fe3093132a31710b6b64cc0a32140e2a5a415a2f1f0907bd22d2L75-R76)
[[2]](diffhunk://#diff-1694a04b8ba9963cc06d651ec6a3be8aa9cb2bcb73c2438dc251ca8cdcb2eb41L31-R37)

**Test updates:**
- Updates negative test cases for `TensorScatter` to run only on CPU,
since CUDA now validates asynchronously and will not synchronously
return errors to the host.
[[1]](diffhunk://#diff-8c90e642cc0cf4e68b2f3d4e4b3f1e21bf6d07f01663d424bc52c75ad0db2dfeR300)
[[2]](diffhunk://#diff-8c90e642cc0cf4e68b2f3d4e4b3f1e21bf6d07f01663d424bc52c75ad0db2dfeL311-R319)
[[3]](diffhunk://#diff-8c90e642cc0cf4e68b2f3d4e4b3f1e21bf6d07f01663d424bc52c75ad0db2dfeL327-R339)
[[4]](diffhunk://#diff-8c90e642cc0cf4e68b2f3d4e4b3f1e21bf6d07f01663d424bc52c75ad0db2dfeL342-R354)
…crosoft#27136)

### Description
Introduces a backend kernel selector config struct in MLAS that allows
users to configure selection of backend kernels at runtime based on
their preference. The immediate use-case of such a feature is to allow
users to opt-out of using/selecting KleidiAI kernels should they choose
to do so on ARM platforms. This solution should scale to other kernel
implementation backends in the future.

### Motivation and Context
Allow users to opt-out of using/selecting KleidiAI kernels should they
choose to do so on ARM platforms

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description

This PR adds a few headers for supporting building WebGPU EP and CUDA EP
as plugin EPs.

See summary of microsoft#26907
…lation… (microsoft#27359)

### Description
This change optimises the GridSample operator in onnxrt.
1- For the GridSample nodes having similar characteristics the camera
based 3D object detection model in MLPerf Automotive space, transforming
the input to output coordinates with : 2D interpolation mode = linear
with padding mode = zeros, align_corners = 0, fast path added.

Linear interpolation: For each (x, y), the code locates the four
surrounding integer pixel centers:

(x1, y1) = (floor(x), floor(y)) (top-left)
(x2, y2) = (x1 + 1, y1 + 1) (bottom-right)
The interpolation weights reflect the fractional positions:

dx1 = x - x1, dx2 = x2 - x
dy1 = y - y1, dy2 = y2 - y 
The resulting value is the bilinear blend dy2 * (dx2 * p11 + dx1 * p12)
+ dy1 * (dx2 * p21 + dx1 * p22) where p11…p22 are the input pixels at
those four neighbor coordinates.


Padding mode = zeros: Any neighbor index that falls outside [0, W_in-1]
× [0, H_in-1] contributes 0 to the interpolation.
Each output pixel (oy, ox) carries normalized coordinates (nx, ny) in
[-1, 1]. With align_corners=0, nx = -1 corresponds to a location half a
pixel before the leftmost input column (i.e., x = -0.5), and nx = 1
corresponds to half a pixel beyond the rightmost column (x = W_in -
0.5). Same idea vertically for ny.

Fast Path Optimisation : The implementation can precompute all neighbor
indices/weights for each output pixel once (they depend only on the
grid), then reuse them for every channel. Previously indices and weights
were calculated inside the loops which can be as much as
(H_out*W_out like 20,000 per batch element in one case) x 32 Channels.


2-optional ARM NEON vectorization added :
- vld1_f32(ptr): loads two contiguous float values into a float32x2_t.
Used to read the top and bottom neighbor
    pairs ([p11, p12], [p21, p22]).
  - vcombine_f32(low, high): concatenates two float32x2_t values into
one float32x4_t, giving [p11, p12, p21, p22].
  - vdup_n_f32(val): duplicates a scalar float into both lanes of a
float32x2_t.
  - vset_lane_f32(val, vec, lane): writes val into the specified lane of
a float32x2_t, letting us form [w11, w12] and
    [w21, w22].
  - vmulq_f32(a, b): multiplies two float32x4_t vectors element-wise
(neighbor pixels × weights).
  - vget_low_f32(vec) / vget_high_f32(vec): extract the lower or upper 2
lanes from a float32x4_t as float32x2_t.
  - vadd_f32(a, b): adds two float32x2_t vectors element-wise (forming
partial sums).
  - vpadd_f32(a, b): performs pairwise adds within and across two
float32x2_t vectors, collapsing four elements down
    to two.
  - vget_lane_f32(vec, lane): reads a scalar from a specific lane,
giving the final interpolated value.
Most of the performance uplift coming from the 1st optimisation. 2nd
optimisation using NEON intrinsics still contributes but not as much as
the 1st one.
Overall performance improvement :

1 thread :
<img width="902" height="766" alt="image"
src="https://github.com/user-attachments/assets/d1fadc6d-370d-4750-baee-1123c7d18af3"
/>


 2 threads:
<img width="902" height="766" alt="image"
src="https://github.com/user-attachments/assets/69c86fd6-815a-4b52-8f86-615f1c99bf0a"
/>



### Motivation and Context
The fast path handles denormalisation of the linear coordinates and can
handle the derivation of the indices by precomputing a separate plan
entry per output pixel. In PrecomputeBilinearSamplePlan2D, the loop runs
across all H_out * W_out points, using the right nx/ny for each (oy, ox)
and storing that point’s four indices, four weights, and mask in
plans[idx].
During evaluation, EvaluatePlanForChannel iterates through the same
point_count(H_out*W_out) and uses the matching plan entry for each (oy,
ox). So we are not reusing one plan across different spatial positions;
we precompute one plan per output location and reuse it only across
channels (which share the same grid).

---------

Signed-off-by: melkap01 <melike.kaptan@arm.com>
Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
microsoft#27483)

### Description
This PR reduces the size of the memory allocation for expected outputs
from ~4GiB to ~2GiB in the Gather_overflow_check test. The updated test
still verifies that the integer overflow fix from PR
microsoft#27444 is valid. That is,
that the CPU Gather operator correctly handles output tensors with
element counts that exceed INT32_MAX.

Changes:

- Reduced test dimension from 65537 to 46341 (output shape from
65537×65537 to 46341×46341), which results in a total number of elements
that is just over INT32_MAX, which is required to test the bug fix.
  - The peak memory usage is reduced to ~4GiB + overhead.
- Increase Android emulator memory to 5GiB (from 4GiB) to be able to run
the test.

### Motivation
Android CI fails to run the unit test introduced in
microsoft#27444 due to memory usage
that exceeds the Android emulator's default memory of 4GiB. This PR
lowers the peak memory usage of the unit test and increases the Android
emulator's memory by 1GiB.
…() (microsoft#27295)

Remove unnecessary s_kernel_registry_vitisaiep.reset() call in
deinitialize_vitisai_ep() function. The kernel registry will be
repopulated on next initialization, making this reset redundant.
…hains (microsoft#27391)

### Description
Profiling shows that CheckIfSubtreesAreEqual is invoked recursively for
many node pairs for LightGBM models with categorical features. A
significant portion of this work consists of self-comparisons (left_id
== right_id), leading to effectively O(n²) comparing trees to themselves
during model loading.

This change adds a fast-path for trivial equality, avoiding unnecessary
recursive comparisons.

Example results:
 - model with 7K BRANCH_EQ nodes: 527 ms → 47 ms (~11× faster)
 - model with 106K BRANCH_EQ nodes: 141 s → 80 ms (~1760× faster)

### Motivation and Context
We have some LightGBM exported models that make heavy use of categorical
features and exhibit extremely slow load times (minutes for a single
2.5mb model).

Heres a diagram to illustrate the issue:
<img width="1008" height="1229" alt="image"
src="https://github.com/user-attachments/assets/348e16cb-9eec-448f-ac5c-e1edb60e2a3d"
/>

the 106K model has much longer "member of" chains, with chains that lead
into more chains:

<details>
  <summary>"trees"</summary>
  
<img width="1405" height="593" alt="image"
src="https://github.com/user-attachments/assets/12f0c43f-5987-4b33-9001-2a2b526e537f"
/>

  
</details>

Interestingly we did also try using the new onnx.ml opset 5 node that
has MEMBER, but it seems even slower as it recreates these branch EQ
chains.
)

### Description
Add support for pre-layer normalization (pre-LN) transformer
architectures in the Python attention fusion optimizer.

### Motivation and Context
Fixes microsoft#11684

Pre-LN models (used in GPT-3, ViT variants, and many modern
architectures) apply LayerNormalization **before** attention rather than
after. The first block of a pre-LN model has no `Add` node before its
first `LayerNormalization` — its input comes directly from a graph
input. This caused `FusionAttention.fuse()` to bail out early because it
assumed every `LayerNormalization` anchor has an `Add` parent (the
residual connection from the previous block).

This PR makes four surgical changes to `fusion_attention.py` so that
pre-LN first-block models fuse correctly, while preserving all existing
post-LN behavior:

1. **Allow LN with graph-input parent** — instead of returning early
when no `Add` parent is found, check whether the input is a graph input
and continue
2. **Include graph inputs in residual collection** — the `other_inputs`
loop previously skipped anything not in `output_name_to_node`; graph
inputs are now recognized
3. **Extend child-LN resolution to SkipLN anchors** — after
`fuse_skip_layer_norm()` runs, the anchor becomes
`SkipLayerNormalization`; the redirect from `root_input` (graph input)
to the first LN's output now fires for SkipLN anchors too
4. **Guard `output_name_to_node` lookup** — graph inputs are not node
outputs, so the dictionary access is now guarded

### Changes
- `onnxruntime/python/tools/transformers/fusion_attention.py` — 4
targeted edits to `FusionAttention.fuse()`
- `onnxruntime/test/python/transformers/bert_model_generator.py` — new
`create_bert_attention_pre_ln()` test graph generator
- `onnxruntime/test/python/transformers/test_attention_fusion.py` — new
`test_attention_fusion_pre_ln()` test

### Test Plan
- [x] New unit test `test_attention_fusion_pre_ln` passes — verifies
`Attention` fused op appears in the optimized graph
- [x] Lintrunner passes on all changed files (no lint issues)
- [x] Changes are minimal and scoped to the pre-LN first-block gap
@ankitm3k ankitm3k closed this Mar 4, 2026
@ankitm3k ankitm3k deleted the sync_msft_02032026 branch March 4, 2026 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.