Skip to content

Sync with Microsoft ONNX Runtime - 04/03/2026#957

Merged
ankitm3k merged 31 commits intoovep-developfrom
sync_msft_04032026
Mar 4, 2026
Merged

Sync with Microsoft ONNX Runtime - 04/03/2026#957
ankitm3k merged 31 commits intoovep-developfrom
sync_msft_04032026

Conversation

@Jaswanth51
Copy link

Synchronizing intel/onnxruntime ovep-develop branch with latest changes from microsoft/onnxruntime master branch.

Rishi-Dave and others added 30 commits February 27, 2026 06:58
…sion (microsoft#27469)

## Summary
- Fix Cast node naming collisions in `convert_float_to_float16` when
nodes have empty names (common in PyTorch exports)
- Fix `ALWAYS_FLOAT_INPUTS` for opset 10 Resize where scales at index 1
was unprotected
- Add dedicated test suite for float16 conversion (`test_float16.py`, 8
tests)

## Motivation
Fixes microsoft#14827

When `convert_float_to_float16` processes models with unnamed nodes
(empty `node.name`, very common in PyTorch/TensorFlow-exported ONNX
models), the generated Cast node names collide. For example, multiple
Resize nodes all produce Cast nodes named `"_input_cast_2"` and output
tensors named `"_input_cast_2"`, corrupting the graph with duplicate
names.

Additionally, the `ALWAYS_FLOAT_INPUTS` dict only protected Resize
scales at index 2 (opset 11+ layout: `[X, roi, scales, sizes]`), but
opset 10 Resize has scales at index 1 (`[X, scales]`), leaving it
unprotected.

## Changes
**`onnxruntime/python/tools/transformers/float16.py`** (11 lines
changed):
- Use unique tensor names (`input_name`/`output`) as the base for
generated Cast node and output names, instead of potentially-empty
`node.name`
- Add index 1 to `ALWAYS_FLOAT_INPUTS["Resize"]` to protect opset 10
scales
- Fix misleading comment ("change current node's input name" → "output
name")

**`onnxruntime/test/python/transformers/test_float16.py`** (new file, 8
tests):
- `test_resize_opset11_cast_naming_unique` — multiple unnamed Resize
nodes produce unique Cast names
- `test_resize_opset11_scales_initializer_stays_fp32` — scales
initializer preserved as float32
- `test_resize_opset10_scales_initializer_stays_fp32` — opset 10 scales
protected at index 1
- `test_resize_opset10_multiple_unnamed_unique_names` — opset 10 naming
uniqueness
- `test_blocked_node_cast_naming_unique` — blocked op nodes (Upsample)
also get unique Cast names
- `test_resize_with_op_block_list` — Resize in op_block_list still
produces unique names
- `test_data_input_converted_to_fp16` — data tensor correctly converts
to fp16
- `test_force_fp16_initializers` — force flag overrides protection

## Test Plan
- All 8 new tests pass locally (`python -m unittest
test_float16.TestFloat16Conversion -v`)
- Existing `test_gpt2_past_fp16` test passes (no regression in existing
float16 behavior)
- `ruff check` passes on both files
…icrosoft#27458)

## Summary
- Add SDPA-aware pattern matching to `FusionBartAttention` so that BART
attention fusion succeeds on models exported with HuggingFace
Transformers >= 4.49
- Add a synthetic BART SDPA graph generator and unit test

## Motivation
Fixes microsoft#23864

HuggingFace Transformers >= 4.49 replaced `BartAttention` with
`BartSdpaAttention` ([commit
`2c47618`](huggingface/transformers@2c47618)),
changing the ONNX export graph topology in several ways that broke
`FusionBartAttention` pattern matching. Running `optimize_model(...,
model_type="bart")` on these newer exports produces **zero** fused
Attention nodes.

## Changes

### `fusion_bart_attention.py`

The SDPA refactor introduces four structural changes to the attention
subgraph. Each required a new match path:

1. **QKV output path — LayerNormalization anchor fallback**
For SDPA models, symbolic shape inference often fails, which prevents
SkipLayerNormalization fusion. When the anchor node is a plain
`LayerNormalization` instead of `SkipLayerNormalization`, there's an
extra residual `Add` between the LayerNorm and the attention output
projection. Added a fallback match: `["Add", "Add", "MatMul", "Reshape",
"Transpose", "MatMul"]` with `[0, None, 0, 0, 0, 0]`.

2. **QK path — NaN guard (Where + IsNaN)**
SDPA wraps the Softmax output in a NaN guard: `Where(IsNaN(softmax),
0.0, softmax)`. The `Where` node's input[2] is the Softmax output. Added
two new QK paths:
   - No mask: `["Where", "Softmax", "MatMul"]` with `[0, 2, 0]`
- With mask: `["Where", "Softmax", "Add", "MatMul"]` with `[0, 2, 0, 0]`

3. **Q and K scaling paths**
Instead of a single combined scale on the QK MatMul output, SDPA applies
separate `Mul(1/sqrt(head_dim))` to Q and K before the QK MatMul. Added:
- Q path: `["Mul", "Transpose", "Reshape", "Add", "MatMul"]` with `[0,
0, 0, 0, None]`
- K path: `["Mul", "Reshape", "Transpose", "Reshape", "Transpose",
"Reshape", "Add", "MatMul"]` with `[1, 0, 0, 0, 0, 0, 0, None]` (K^T
uses a `Reshape→Transpose(0,2,1)→Reshape` chain)

4. **num_heads fallback for dynamic shapes**
SDPA models use `-1` in reshape shape tensors for dynamic dimensions,
causing `get_num_heads_and_hidden_size` to return negative values. Added
a fallback to user-specified `num_heads`/`hidden_size` when detected
values are invalid.

### `bart_model_generator.py` (new)

Synthetic BART SDPA attention graph generator that builds a minimal but
complete attention subgraph matching the SDPA topology. Tests both
`with_mask=True` (decoder self-attention) and `with_mask=False` (encoder
attention) variants.

### `test_attention_fusion.py`

Added `test_bart_attention_sdpa_fusion` that verifies:
- 1 Attention node is produced for each mask variant
- Correct `num_heads` attribute
- Correct `unidirectional` attribute (1 for decoder self-attention with
mask, 0 for encoder)

## Test Plan
- [x] `python -m pytest test_attention_fusion.py -v` — all 10 tests pass
- [x] `lintrunner` on all 3 changed files — no issues
- [x] Verified on real exported BART SDPA model
(`hf-internal-testing/tiny-random-bart`): 2 Attention nodes fused, graph
reduced from 120 → 34 nodes
…osoft#27428)

Replace and reland microsoft#27129

Comparison between this PR approach and inline in softmax

## Tradeoffs

| Category | Pre-conversion (current) | Inline in softmax |
| :--- | :--- | :--- |
| **Memory** | Extra buffer ($num\_elements \times sizeof(T)$) | None —
reads 1-byte bool directly |
| **Kernel launches** | +1 simple elementwise kernel | Zero extra |
| **Code complexity** | 3 files, ~40 lines added | 6+ kernel templates,
macros, dispatch logic, data structs |
| **Risk** | Low — softmax path untested | High — modifying
battle-tested softmax kernels used by MHA + GQA contrib ops |
| **Perf impact** | Negligible — mask is small vs. QKV; conversion is
memory-bound and fast | Slightly better theoretical bandwidth |
| **Maintainability** | Clean separation of concerns | Adds template
dimension across all softmax variants |

---

This pull request enhances the ONNX Runtime CUDA Attention operator to
support boolean attention masks (bool masks) in the Multi-Head Attention
(MHA) path, converting them to additive attention bias on the GPU. It
also improves test coverage to ensure correctness and parity with the
CPU implementation. The main changes include implementing a CUDA kernel
for mask conversion, updating the operator logic to handle bool masks,
clarifying broadcasting rules, and adding comprehensive unit tests.

**CUDA Attention Operator Improvements:**

* Implemented a CUDA kernel (`LaunchConvertBoolMaskToAttentionBias`)
that converts boolean attention masks to additive bias (True → 0.0,
False → mask_filter_value) for the MHA path, ensuring efficient GPU
execution.
[[1]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49R148-R187)
[[2]](diffhunk://#diff-8aa9a15a92d7dc138346dce5de055911895d940ba2183b4ba45bd95ac0e5bfc9R55-R66)
* Updated `attention.cc` to use this kernel, correctly handle bool masks
in the MHA path, and clarified the broadcasting logic and mask shape
interpretation for both GQA and MHA.
[[1]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffR6)
[[2]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffR380-R383)
[[3]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL514-L522)
[[4]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL549-R557)
[[5]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffR595-R616)

**Testing and Documentation Enhancements:**

* Added new test cases and a dedicated test class to validate the
correctness of boolean mask handling in the MHA path, ensuring parity
with the CPU implementation for 2D, 3D, and 4D mask shapes.
[[1]](diffhunk://#diff-801fbbcf2537e8e13a0202e6a0f7e88c56ab5aa72d17d949a5556355694b2b2dR563-R725)
[[2]](diffhunk://#diff-801fbbcf2537e8e13a0202e6a0f7e88c56ab5aa72d17d949a5556355694b2b2dR893-R922)
* Improved comments and documentation in both code and tests to clarify
ONNX broadcasting rules and mask shape expectations for different
attention paths.
[[1]](diffhunk://#diff-4ed1461afda0d3804a61ba95a64b2a84d0c1395f9c887d1a3fdfed914ade22c1L208-R221)
[[2]](diffhunk://#diff-801fbbcf2537e8e13a0202e6a0f7e88c56ab5aa72d17d949a5556355694b2b2dR35)

**Test Coverage and Reliability:**

* Enabled CUDA-based tests for boolean mask scenarios previously only
tested on CPU, and adjusted test logic to ensure correct handling of
edge cases (e.g., all-false masks).
[[1]](diffhunk://#diff-3ff6dfa2ce407ae0073009174c37d1756509e8bbc434dee7c44cd55a996bb777L477-R480)
[[2]](diffhunk://#diff-3ff6dfa2ce407ae0073009174c37d1756509e8bbc434dee7c44cd55a996bb777L620-R623)

These changes make the CUDA Attention operator more robust and
feature-complete, aligning its behavior with the CPU implementation and
ONNX specifications.
…icrosoft#27470)

Heap-allocate `WebGpuContextFactory::contexts_` to avoid crashes during
static destruction when dependent DLLs (e.g. dxcompiler.dll) have
already been unloaded. `Cleanup()` explicitly deletes the map; on
abnormal termination it safely leaks.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…27480)

### Description

### Motivation and Context

Resolves ongoing security issues with auditing the jar testing pipeline
in the backend.
…7478)

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…icrosoft#27474)

### Description
Improves the error message returned by ORT when loading a compiled model
in a session that does not have the required execution provider(s).

For example, ORT now returns the following error when creating session
with a model compiled explicitly for OpenVINO EP without adding OpenVINO
EP to the session:

> EPContext node generated by 'OpenVINOExecutionProvider' is not
compatible with any execution provider added to the session. EPContext
node name: 'EPContextNode0'. Available session execution providers:
[CPUExecutionProvider].

Compare the above message with the more generic message that is
currently returned:

>Could not find an implementation for EPContext(1) node with name
'EPContextNode0'

### Motivation and Context
Improves diagnosability when loading of pre-compiled models fails.
Specifically, the ambiguity of the original message led to many hours
spent debugging an error where a compiled model failed to run because
the expected `OrtEpDevice` was inadvertently not added to a session.
… compatibility (microsoft#27484)

This pull request refactors validation logic for CUDA attention masks
and tensor scatter operations to move error checking from host-side
(CPU) to device-side (GPU) using CUDA kernel assertions
(`CUDA_KERNEL_ASSERT`). This change eliminates synchronous host-device
memory transfers and stream synchronizations, improving performance and
simplifying code. Corresponding test cases are updated to only expect
validation failures on the CPU, as CUDA errors are now asynchronous.

Key changes:

**Attention mask validation (GQA path):**
- Removes host-side validation and memory copies for boolean attention
masks in `attention.cc`; mask validity (right-padding, contiguous
True/False) is now checked asynchronously via `CUDA_KERNEL_ASSERT` in
the CUDA kernel.
[[1]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL385-L387)
[[2]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL414-L418)
[[3]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL427-L448)
- Updates the CUDA kernel and its interface to drop the
`validation_result` buffer and rely on device assertions for mask
validation. Documentation is updated to reflect this asynchronous error
checking.
[[1]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L10-R17)
[[2]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L34)
[[3]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L81-R76)
[[4]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L104-R92)
[[5]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L118)
[[6]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L137)
[[7]](diffhunk://#diff-8aa9a15a92d7dc138346dce5de055911895d940ba2183b4ba45bd95ac0e5bfc9L37-L45)

**TensorScatter write_indices validation:**
- Removes host-side validation and synchronization for `write_indices`
in `tensorscatter.cc`; index bounds checking is now performed
asynchronously inside the CUDA kernel via `CUDA_KERNEL_ASSERT`.
[[1]](diffhunk://#diff-d69233ff3987fe3093132a31710b6b64cc0a32140e2a5a415a2f1f0907bd22d2L75-R76)
[[2]](diffhunk://#diff-1694a04b8ba9963cc06d651ec6a3be8aa9cb2bcb73c2438dc251ca8cdcb2eb41L31-R37)

**Test updates:**
- Updates negative test cases for `TensorScatter` to run only on CPU,
since CUDA now validates asynchronously and will not synchronously
return errors to the host.
[[1]](diffhunk://#diff-8c90e642cc0cf4e68b2f3d4e4b3f1e21bf6d07f01663d424bc52c75ad0db2dfeR300)
[[2]](diffhunk://#diff-8c90e642cc0cf4e68b2f3d4e4b3f1e21bf6d07f01663d424bc52c75ad0db2dfeL311-R319)
[[3]](diffhunk://#diff-8c90e642cc0cf4e68b2f3d4e4b3f1e21bf6d07f01663d424bc52c75ad0db2dfeL327-R339)
[[4]](diffhunk://#diff-8c90e642cc0cf4e68b2f3d4e4b3f1e21bf6d07f01663d424bc52c75ad0db2dfeL342-R354)
…crosoft#27136)

### Description
Introduces a backend kernel selector config struct in MLAS that allows
users to configure selection of backend kernels at runtime based on
their preference. The immediate use-case of such a feature is to allow
users to opt-out of using/selecting KleidiAI kernels should they choose
to do so on ARM platforms. This solution should scale to other kernel
implementation backends in the future.

### Motivation and Context
Allow users to opt-out of using/selecting KleidiAI kernels should they
choose to do so on ARM platforms

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description

This PR adds a few headers for supporting building WebGPU EP and CUDA EP
as plugin EPs.

See summary of microsoft#26907
…lation… (microsoft#27359)

### Description
This change optimises the GridSample operator in onnxrt.
1- For the GridSample nodes having similar characteristics the camera
based 3D object detection model in MLPerf Automotive space, transforming
the input to output coordinates with : 2D interpolation mode = linear
with padding mode = zeros, align_corners = 0, fast path added.

Linear interpolation: For each (x, y), the code locates the four
surrounding integer pixel centers:

(x1, y1) = (floor(x), floor(y)) (top-left)
(x2, y2) = (x1 + 1, y1 + 1) (bottom-right)
The interpolation weights reflect the fractional positions:

dx1 = x - x1, dx2 = x2 - x
dy1 = y - y1, dy2 = y2 - y 
The resulting value is the bilinear blend dy2 * (dx2 * p11 + dx1 * p12)
+ dy1 * (dx2 * p21 + dx1 * p22) where p11…p22 are the input pixels at
those four neighbor coordinates.


Padding mode = zeros: Any neighbor index that falls outside [0, W_in-1]
× [0, H_in-1] contributes 0 to the interpolation.
Each output pixel (oy, ox) carries normalized coordinates (nx, ny) in
[-1, 1]. With align_corners=0, nx = -1 corresponds to a location half a
pixel before the leftmost input column (i.e., x = -0.5), and nx = 1
corresponds to half a pixel beyond the rightmost column (x = W_in -
0.5). Same idea vertically for ny.

Fast Path Optimisation : The implementation can precompute all neighbor
indices/weights for each output pixel once (they depend only on the
grid), then reuse them for every channel. Previously indices and weights
were calculated inside the loops which can be as much as
(H_out*W_out like 20,000 per batch element in one case) x 32 Channels.


2-optional ARM NEON vectorization added :
- vld1_f32(ptr): loads two contiguous float values into a float32x2_t.
Used to read the top and bottom neighbor
    pairs ([p11, p12], [p21, p22]).
  - vcombine_f32(low, high): concatenates two float32x2_t values into
one float32x4_t, giving [p11, p12, p21, p22].
  - vdup_n_f32(val): duplicates a scalar float into both lanes of a
float32x2_t.
  - vset_lane_f32(val, vec, lane): writes val into the specified lane of
a float32x2_t, letting us form [w11, w12] and
    [w21, w22].
  - vmulq_f32(a, b): multiplies two float32x4_t vectors element-wise
(neighbor pixels × weights).
  - vget_low_f32(vec) / vget_high_f32(vec): extract the lower or upper 2
lanes from a float32x4_t as float32x2_t.
  - vadd_f32(a, b): adds two float32x2_t vectors element-wise (forming
partial sums).
  - vpadd_f32(a, b): performs pairwise adds within and across two
float32x2_t vectors, collapsing four elements down
    to two.
  - vget_lane_f32(vec, lane): reads a scalar from a specific lane,
giving the final interpolated value.
Most of the performance uplift coming from the 1st optimisation. 2nd
optimisation using NEON intrinsics still contributes but not as much as
the 1st one.
Overall performance improvement :

1 thread :
<img width="902" height="766" alt="image"
src="https://github.com/user-attachments/assets/d1fadc6d-370d-4750-baee-1123c7d18af3"
/>


 2 threads:
<img width="902" height="766" alt="image"
src="https://github.com/user-attachments/assets/69c86fd6-815a-4b52-8f86-615f1c99bf0a"
/>



### Motivation and Context
The fast path handles denormalisation of the linear coordinates and can
handle the derivation of the indices by precomputing a separate plan
entry per output pixel. In PrecomputeBilinearSamplePlan2D, the loop runs
across all H_out * W_out points, using the right nx/ny for each (oy, ox)
and storing that point’s four indices, four weights, and mask in
plans[idx].
During evaluation, EvaluatePlanForChannel iterates through the same
point_count(H_out*W_out) and uses the matching plan entry for each (oy,
ox). So we are not reusing one plan across different spatial positions;
we precompute one plan per output location and reuse it only across
channels (which share the same grid).

---------

Signed-off-by: melkap01 <melike.kaptan@arm.com>
Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
microsoft#27483)

### Description
This PR reduces the size of the memory allocation for expected outputs
from ~4GiB to ~2GiB in the Gather_overflow_check test. The updated test
still verifies that the integer overflow fix from PR
microsoft#27444 is valid. That is,
that the CPU Gather operator correctly handles output tensors with
element counts that exceed INT32_MAX.

Changes:

- Reduced test dimension from 65537 to 46341 (output shape from
65537×65537 to 46341×46341), which results in a total number of elements
that is just over INT32_MAX, which is required to test the bug fix.
  - The peak memory usage is reduced to ~4GiB + overhead.
- Increase Android emulator memory to 5GiB (from 4GiB) to be able to run
the test.

### Motivation
Android CI fails to run the unit test introduced in
microsoft#27444 due to memory usage
that exceeds the Android emulator's default memory of 4GiB. This PR
lowers the peak memory usage of the unit test and increases the Android
emulator's memory by 1GiB.
…() (microsoft#27295)

Remove unnecessary s_kernel_registry_vitisaiep.reset() call in
deinitialize_vitisai_ep() function. The kernel registry will be
repopulated on next initialization, making this reset redundant.
…hains (microsoft#27391)

### Description
Profiling shows that CheckIfSubtreesAreEqual is invoked recursively for
many node pairs for LightGBM models with categorical features. A
significant portion of this work consists of self-comparisons (left_id
== right_id), leading to effectively O(n²) comparing trees to themselves
during model loading.

This change adds a fast-path for trivial equality, avoiding unnecessary
recursive comparisons.

Example results:
 - model with 7K BRANCH_EQ nodes: 527 ms → 47 ms (~11× faster)
 - model with 106K BRANCH_EQ nodes: 141 s → 80 ms (~1760× faster)

### Motivation and Context
We have some LightGBM exported models that make heavy use of categorical
features and exhibit extremely slow load times (minutes for a single
2.5mb model).

Heres a diagram to illustrate the issue:
<img width="1008" height="1229" alt="image"
src="https://github.com/user-attachments/assets/348e16cb-9eec-448f-ac5c-e1edb60e2a3d"
/>

the 106K model has much longer "member of" chains, with chains that lead
into more chains:

<details>
  <summary>"trees"</summary>
  
<img width="1405" height="593" alt="image"
src="https://github.com/user-attachments/assets/12f0c43f-5987-4b33-9001-2a2b526e537f"
/>

  
</details>

Interestingly we did also try using the new onnx.ml opset 5 node that
has MEMBER, but it seems even slower as it recreates these branch EQ
chains.
)

### Description
Add support for pre-layer normalization (pre-LN) transformer
architectures in the Python attention fusion optimizer.

### Motivation and Context
Fixes microsoft#11684

Pre-LN models (used in GPT-3, ViT variants, and many modern
architectures) apply LayerNormalization **before** attention rather than
after. The first block of a pre-LN model has no `Add` node before its
first `LayerNormalization` — its input comes directly from a graph
input. This caused `FusionAttention.fuse()` to bail out early because it
assumed every `LayerNormalization` anchor has an `Add` parent (the
residual connection from the previous block).

This PR makes four surgical changes to `fusion_attention.py` so that
pre-LN first-block models fuse correctly, while preserving all existing
post-LN behavior:

1. **Allow LN with graph-input parent** — instead of returning early
when no `Add` parent is found, check whether the input is a graph input
and continue
2. **Include graph inputs in residual collection** — the `other_inputs`
loop previously skipped anything not in `output_name_to_node`; graph
inputs are now recognized
3. **Extend child-LN resolution to SkipLN anchors** — after
`fuse_skip_layer_norm()` runs, the anchor becomes
`SkipLayerNormalization`; the redirect from `root_input` (graph input)
to the first LN's output now fires for SkipLN anchors too
4. **Guard `output_name_to_node` lookup** — graph inputs are not node
outputs, so the dictionary access is now guarded

### Changes
- `onnxruntime/python/tools/transformers/fusion_attention.py` — 4
targeted edits to `FusionAttention.fuse()`
- `onnxruntime/test/python/transformers/bert_model_generator.py` — new
`create_bert_attention_pre_ln()` test graph generator
- `onnxruntime/test/python/transformers/test_attention_fusion.py` — new
`test_attention_fusion_pre_ln()` test

### Test Plan
- [x] New unit test `test_attention_fusion_pre_ln` passes — verifies
`Attention` fused op appears in the optimized graph
- [x] Lintrunner passes on all changed files (no lint issues)
- [x] Changes are minimal and scoped to the pre-LN first-block gap
…osoft#27490)

### Description
<!-- Describe your changes. -->

Split out Linux CUDA Python package builds into separate stages.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Reduce overall packaging pipeline time by running Python version builds
in separate stages, allowing them to run in parallel.

Example build:
https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=1102253&view=results
Reduced time from ~3h30m to 1h38m.
…ompiler on Linux builds (microsoft#27454)

### Description
Suppress spurious Array Out of Bounds warnings produced by GCC 14.2
compiler on Linux builds

### Motivation and Context
Linux build fails when compiled with GCC 14.2 due to spurious Array Out
of Bounds warnings (Warnings Treated as Errors)
…ft#27471)

### Description

Apply the same double-free fix from NvTensorRtRtx EP ([PR
microsoft#27192](microsoft#27192)) to the TRT
EP.

`CreateTensorRTCustomOpDomainList()` owns domains/ops via static
`unique_ptr`s, but `ReleaseTensorRTCustomOpDomain()` was manually
`delete`-ing the same objects through raw pointers — double-free at
program exit.

- `ReleaseTensorRTCustomOpDomain()` → no-op (static `unique_ptr`s own
the lifetime)
- `ReleaseTensorRTCustomOpDomainList()` → `clear()` the reference vector
only
- Added ownership comments to static members matching NvTensorRtRtx EP
style

### Motivation and Context

PR microsoft#27192 review
([thread](microsoft#27192 (comment)))
identified TRT EP has the identical bug pattern that was fixed in
NvTensorRtRtx EP. The TRT EP code was the original source this pattern
was borrowed from. @tianleiwu noted a follow-up PR was needed.

<!-- START COPILOT CODING AGENT TIPS -->
---

🔒 GitHub Advanced Security automatically protects Copilot coding agent
pull requests. You can protect all pull requests by enabling Advanced
Security for your repositories. [Learn more about Advanced
Security.](https://gh.io/cca-advanced-security)

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
### Description

The `-Warray-bounds` suppression pragma in
`sqnbitgemm_kernel_avx2_int8_blklen32.h` was gated on
`defined(HAS_ARRAY_BOUNDS)`, which is set in `onnxruntime_config.h`.
MLAS never includes that header, so the guard was dead code and the
pragma never fired.

Changed the guard to `#ifdef __clang__`:

```cpp
// Before: HAS_ARRAY_BOUNDS never defined in MLAS TU
#if defined(__clang__) && defined(HAS_ARRAY_BOUNDS)

// After
#ifdef __clang__
```

Note: `__has_warning("-Warray-bounds")` was considered but the C
preprocessor does not short-circuit `&&`, so GCC fails to parse it even
behind `defined(__clang__)`.

### Motivation and Context

Build fails on Intel Mac with Apple Clang 17.0.0
(`-Werror,-Warray-bounds`). Clang raises a false-positive array-bounds
warning on `acc[4..7]` inside an `if constexpr (NCols4 == 8)` branch
that is dead when `NCols4 == 4`.

<!-- START COPILOT ORIGINAL PROMPT -->



<details>

<summary>Original prompt</summary>

> 
> ----
> 
> *This section details on the original issue you should resolve*
> 
> <issue_title>[Build] error: array index 4 is past the end of the array
(that has type '__m256[4]') [-Werror,-Warray-bounds]</issue_title>
> <issue_description>### Describe the issue
> 
> Unable to build from main branch
(0768f42 as of time writing this issue)
on Intel Mac
> 
> ```
> /usr/bin/c++ --version
> Apple clang version 17.0.0 (clang-1700.0.13.5)
> Target: x86_64-apple-darwin24.5.0
> Thread model: posix
> InstalledDir: /Library/Developer/CommandLineTools/usr/bin
> ```
> 
> 
> ### Urgency
> 
> _No response_
> 
> ### Target platform
> 
> MacOS
> 
> ### Build script
> 
> ./build.sh --config RelWithDebInfo --build_shared_lib --parallel
--cmake_extra_defines CMAKE_OSX_ARCHITECTURES=x86_64
> 
> ### Error / output
> 
> [ 18%] Building CXX object
CMakeFiles/onnxruntime_mlas.dir/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2.cpp.o
> In file included from
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2.cpp:26:
>
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1677:49:
error: array index 4 is past the end of the array (that has type
'__m256[4]') [-Werror,-Warray-bounds]
> 1677 | __m128 acc_1 = FoldAccumulators(acc[4], acc[5], acc[6],
acc[7]);
>       |                                                 ^   ~
>
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1531:13:
note: array 'acc' declared here
>  1531 |             __m256 acc[NCols4];
>       |             ^
>
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1677:57:
error: array index 5 is past the end of the array (that has type
'__m256[4]') [-Werror,-Warray-bounds]
> 1677 | __m128 acc_1 = FoldAccumulators(acc[4], acc[5], acc[6],
acc[7]);
>       |                                                         ^   ~
>
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1531:13:
note: array 'acc' declared here
>  1531 |             __m256 acc[NCols4];
>       |             ^
>
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1677:65:
error: array index 6 is past the end of the array (that has type
'__m256[4]') [-Werror,-Warray-bounds]
> 1677 | __m128 acc_1 = FoldAccumulators(acc[4], acc[5], acc[6],
acc[7]);
> | ^ ~
>
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1531:13:
note: array 'acc' declared here
>  1531 |             __m256 acc[NCols4];
>       |             ^
>
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1677:73:
error: array index 7 is past the end of the array (that has type
'__m256[4]') [-Werror,-Warray-bounds]
> 1677 | __m128 acc_1 = FoldAccumulators(acc[4], acc[5], acc[6],
acc[7]);
> | ^ ~
>
/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1531:13:
note: array 'acc' declared here
>  1531 |             __m256 acc[NCols4];
>       |             ^
> 4 errors generated.
> 
> ### Visual Studio Version
> 
> _No response_
> 
> ### GCC / Compiler Version
> 
> Apple clang version 17.0.0 (clang-1700.0.13.5)</issue_description>
> 
> <agent_instructions>Please investigate the build error. If code need
fix, create a pull requests. Otherwise, suggest ways to avoid the build
errors.</agent_instructions>
> 
> ## Comments on the Issue (you are @copilot in this section)
> 
> <comments>
> </comments>
> 


</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes microsoft#27497

<!-- START COPILOT CODING AGENT TIPS -->
---

🔒 GitHub Advanced Security automatically protects Copilot coding agent
pull requests. You can protect all pull requests by enabling Advanced
Security for your repositories. [Learn more about Advanced
Security.](https://gh.io/cca-advanced-security)

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
There is build error using `--use_vcpkg` without
`--use_vcpkg_ms_internal_asset_cache`, the error is like:
```
C:\code\onnxruntime\cmake\./vcpkg-ports\pybind11: info: installing overlay port from here
Downloading https://github.com/pybind/pybind11/archive/v3.0.2.tar.gz -> pybind-pybind11-v3.0.2.tar.gz
pybind-pybind11-v3.0.2.tar.gz.33772.part: error: download from https://github.com/pybind/pybind11/archive/v3.0.2.tar.gz had an unexpected hash
note: Expected: 786b1bf534ac67a8d5669f8babf67bb13e48b3a3da1b6344e43ae10a84b80bbc8fea5f12a65fd18739c341fefef5622c5dc096db964dff33cc62ea4259b2e2c1
note: Actual  : 19bee2c76320e25202ee078b5680ff8a7acfb33494dec29dad984ab04de8bcb01340d9fec37c8cc5ac9015dfc367e60312dcd8506e66ce8f0af4c49db562ddef
CMake Error at scripts/cmake/vcpkg_download_distfile.cmake:136 (message):
  Download failed, halting portfile.
```

The root cause is that I uploaded zip file to cache server. Without
`--use_vcpkg_ms_internal_asset_cache`, vcpkg will try download tar.gz
file from github, and the SHA is different from the one of zip file.

In this PR, I configure the portfile to download zip file to avoid the
issue.
### Description
setCudaGraphStrategy(kWHOLE_GRAPH_CAPTURE) was present in the dynamic
engine build path (CreateNodeComputeInfoFromGraph) but missing from the
precompiled/AOT engine path (CreateNodeComputeInfoFromEPContext). Since
TRT RTX defaults the CUDA Graph strategy to kDISABLED, CUDA Graph
capture never occurred when loading precompiled engines. Applied the
same setCudaGraphStrategy call (guarded by the existing TRT_MAJOR_RTX >=
1.3 version check) to the precompiled path to match the dynamic path
behavior



### Motivation and Context
Fixes [microsoft#27329](microsoft#27329) —
users reported that cudaGraphLaunch was not occurring when using
precompiled (AOT-built) TensorRT-RTX engines, causing individual kernel
launches and unnecessary CPU overhead instead of batched graph
execution.
…7489)

## Summary
- Allow `SkipLayerNormalization` fusion when Add inputs have
broadcast-compatible shapes
- Add `get_skip_index()` to identify skip tensor and ensure correct
input ordering
- Add tests for 2D `(S, H)` and 3D `(1, S, H)` broadcast skip shapes

## Motivation
Fixes microsoft#27488

The Python optimizer's `FusionSkipLayerNormalization` rejects the `Add →
LayerNormalization` fusion when the two Add inputs have different but
broadcast-compatible shapes. The C++ `SkipLayerNormalization` kernel
already supports broadcasting (skip can be 2D or 3D with last two dims
matching input), but the Python fusion used an exact shape equality
check that blocked these cases. This resolves the existing TODO comments
from @tianleiwu.

## Changes
- **`fusion_skiplayernorm.py`**: Added `_is_broadcast_skip()` helper and
`get_skip_index()` method (following the pattern from
`FusionSkipGroupNorm`). Replaced strict `compare_shape()` equality check
with broadcast-aware logic. Used `skip_index` to ensure the full-sized
input goes to position 0 and the skip to position 1 in the fused node.
- **`test_skip_layer_norm_fusion.py`**: Added
`create_broadcast_test_model()` and 4 new test cases covering 2D/3D
broadcast skip shapes in both Add input positions.

## Test Plan
- [x] All 6 existing `test_skip_layer_norm_fusion` tests pass (no
regressions)
- [x] All 4 new broadcast tests pass
- [x] `ruff check` passes on changed files
…icrosoft#27518)

### Description
<!-- Describe your changes. -->
Detect and test mismatch between raw data size and declared data type
and shape of the lora adapter parameter.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Disallow maliciously crafted lora adapters leading to heap OOB.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ngs (microsoft#27288)

### Description
<!-- Describe your changes. -->

Compilation on Clang toolchains on Linux currently fails due to this
warning (among others) since ONNX runtime compiles with -Werror by
default. We address `-Winconsistent-missing-override` with this PR in
TRT NV EP.

```
/home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/nv_tensorrt_rtx/nv_execution_provider.h:309:7: error: 'GetDeviceId' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
  309 |   int GetDeviceId() const { return device_id_; }
      |       ^
/home/stephan/projects/onnxruntime-winai/include/onnxruntime/core/framework/execution_provider.h:183:15: note: overridden virtual function is here
  183 |   virtual int GetDeviceId() const { return default_device_.Id(); }
      |               ^
In file included from /home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc:18:
/home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/nv_tensorrt_rtx/nv_execution_provider.h:310:10: error: 'Sync' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
  310 |   Status Sync() const;
      |          ^
/home/stephan/projects/onnxruntime-winai/include/onnxruntime/core/framework/execution_provider.h:231:26: note: overridden virtual function is here
  231 |   virtual common::Status Sync() const { return Status::OK(); }
      |                          ^
/home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc:63:39: error: 'CreateProvider' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
   63 |   std::unique_ptr<IExecutionProvider> CreateProvider(const OrtSessionOptions& session_options,
      |                                       ^
/home/stephan/projects/onnxruntime-winai/include/onnxruntime/core/providers/providers.h:29:47: note: overridden virtual function is here
   29 |   virtual std::unique_ptr<IExecutionProvider> CreateProvider(const OrtSessionOptions& session_options,
      |                                               ^
/home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc:112:46: error: 'CreateExecutionProviderFactory' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
  112 |   std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory(const void* param) {
      |                                              ^
/home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/shared_library/provider_host_api.h:19:54: note: overridden virtual function is here
   19 |   virtual std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory(const void* /*provider_options*/) { return nullptr;
/home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/nv_tensorrt_rtx/nv_execution_provider.h:309:7: error: 'GetDeviceId' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
  309 |   int GetDeviceId() const { return device_id_; }
      |       ^
/home/stephan/projects/onnxruntime-winai/include/onnxruntime/core/framework/execution_provider.h:183:15: note: overridden virtual function is here
  183 |   virtual int GetDeviceId() const { return default_device_.Id(); }
```

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fixing clang warnings enables builds with clang on Linux since `-Werror`
enforces warning-free builds.
… and provider bridge EPs (microsoft#27522)

### Description
<!-- Describe your changes. -->
Set "library_path" metadata entry in OrtEpDevice instances for plugin
and provider bridge EPs.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Make available everywhere. Required by GenAI to load custom ops library.

microsoft#27496
…rs on Windows (microsoft#27039)

### Description
<!-- Describe your changes. -->
Disable cmake prologue in pch file which results in warnings being
unexpectedly suppressed in the unit test projects that use precompiled
headers.

e.g. C4834 was suppressed so there was no warning for the nodiscard
Status return value not being checked in a Windows build. Generates a
warning on other platforms.

Update a few tests to resolve warnings that now get triggered.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Warnings causing build failures for non-Windows CIs that should have
been generated for Windows builds.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description
YAML that's no longer used.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Non required builds fail because Lora Tests use `ASSERT_THROW` while
RTTI is disabled.

Followup: microsoft#27518
### Motivation and Context
Change this because for 32B model like Qwen2.5-coder-32B in TRTRTX ep,
there is a long string in GenAI


https://github.com/microsoft/onnxruntime-genai/blob/3c47932e9d7afa0d44db0b3918e479bbdd4c5353/src/models/model.cpp#L516

Example

```
AddConfigEntry: ep.nvtensorrtrtxexecutionprovider.nv_profile_min_shapes (length=4364) = input_ids:1x1,attention_mask:1x1,past_key_values.0.key:1x8x0x128,past_key_values.0.value:1x8x0x128,past_key_values.1.key:1x8x0x128,past_key_values.1.value:1x8x0x128,past_key_values.2.key:1x8x0x128,past_key_values.2.value:1x8x0x128,past_key_values.3.key:1x8x0x128,past_key_values.3.value:1x8x0x128,past_key_values.4.key:1x8x0x128,past_key_values.4.value:1x8x0x128,past_key_values.5.key:1x8x0x128,past_key_values.5.value:1x8x0x128,past_key_values.6.key:1x8x0x128,past_key_values.6.value:1x8x0x128,past_key_values.7.key:1x8x0x128,past_key_values.7.value:1x8x0x128,past_key_values.8.key:1x8x0x128,past_key_values.8.value:1x8x0x128,past_key_values.9.key:1x8x0x128,past_key_values.9.value:1x8x0x128,past_key_values.10.key:1x8x0x128,past_key_values.10.value:1x8x0x128,past_key_values.11.key:1x8x0x128,past_key_values.11.value:1x8x0x128,past_key_values.12.key:1x8x0x128,past_key_values.12.value:1x8x0x128,past_key_values.13.key:1x8x0x128,past_key_values.13.value:1x8x0x128,past_key_values.14.key:1x8x0x128,past_key_values.14.value:1x8x0x128,past_key_values.15.key:1x8x0x128,past_key_values.15.value:1x8x0x128,past_key_values.16.key:1x8x0x128,past_key_values.16.value:1x8x0x128,past_key_values.17.key:1x8x0x128,past_key_values.17.value:1x8x0x128,past_key_values.18.key:1x8x0x128,past_key_values.18.value:1x8x0x128,past_key_values.19.key:1x8x0x128,past_key_values.19.value:1x8x0x128,past_key_values.20.key:1x8x0x128,past_key_values.20.value:1x8x0x128,past_key_values.21.key:1x8x0x128,past_key_values.21.value:1x8x0x128,past_key_values.22.key:1x8x0x128,past_key_values.22.value:1x8x0x128,past_key_values.23.key:1x8x0x128,past_key_values.23.value:1x8x0x128,past_key_values.24.key:1x8x0x128,past_key_values.24.value:1x8x0x128,past_key_values.25.key:1x8x0x128,past_key_values.25.value:1x8x0x128,past_key_values.26.key:1x8x0x128,past_key_values.26.value:1x8x0x128,past_key_values.27.key:1x8x0x128,past_key_values.27.value:1x8x0x128,past_key_values.28.key:1x8x0x128,past_key_values.28.value:1x8x0x128,past_key_values.29.key:1x8x0x128,past_key_values.29.value:1x8x0x128,past_key_values.30.key:1x8x0x128,past_key_values.30.value:1x8x0x128,past_key_values.31.key:1x8x0x128,past_key_values.31.value:1x8x0x128,past_key_values.32.key:1x8x0x128,past_key_values.32.value:1x8x0x128,past_key_values.33.key:1x8x0x128,past_key_values.33.value:1x8x0x128,past_key_values.34.key:1x8x0x128,past_key_values.34.value:1x8x0x128,past_key_values.35.key:1x8x0x128,past_key_values.35.value:1x8x0x128,past_key_values.36.key:1x8x0x128,past_key_values.36.value:1x8x0x128,past_key_values.37.key:1x8x0x128,past_key_values.37.value:1x8x0x128,past_key_values.38.key:1x8x0x128,past_key_values.38.value:1x8x0x128,past_key_values.39.key:1x8x0x128,past_key_values.39.value:1x8x0x128,past_key_values.40.key:1x8x0x128,past_key_values.40.value:1x8x0x128,past_key_values.41.key:1x8x0x128,past_key_values.41.value:1x8x0x128,past_key_values.42.key:1x8x0x128,past_key_values.42.value:1x8x0x128,past_key_values.43.key:1x8x0x128,past_key_values.43.value:1x8x0x128,past_key_values.44.key:1x8x0x128,past_key_values.44.value:1x8x0x128,past_key_values.45.key:1x8x0x128,past_key_values.45.value:1x8x0x128,past_key_values.46.key:1x8x0x128,past_key_values.46.value:1x8x0x128,past_key_values.47.key:1x8x0x128,past_key_values.47.value:1x8x0x128,past_key_values.48.key:1x8x0x128,past_key_values.48.value:1x8x0x128,past_key_values.49.key:1x8x0x128,past_key_values.49.value:1x8x0x128,past_key_values.50.key:1x8x0x128,past_key_values.50.value:1x8x0x128,past_key_values.51.key:1x8x0x128,past_key_values.51.value:1x8x0x128,past_key_values.52.key:1x8x0x128,past_key_values.52.value:1x8x0x128,past_key_values.53.key:1x8x0x128,past_key_values.53.value:1x8x0x128,past_key_values.54.key:1x8x0x128,past_key_values.54.value:1x8x0x128,past_key_values.55.key:1x8x0x128,past_key_values.55.value:1x8x0x128,past_key_values.56.key:1x8x0x128,past_key_values.56.value:1x8x0x128,past_key_values.57.key:1x8x0x128,past_key_values.57.value:1x8x0x128,past_key_values.58.key:1x8x0x128,past_key_values.58.value:1x8x0x128,past_key_values.59.key:1x8x0x128,past_key_values.59.value:1x8x0x128,past_key_values.60.key:1x8x0x128,past_key_values.60.value:1x8x0x128,past_key_values.61.key:1x8x0x128,past_key_values.61.value:1x8x0x128,past_key_values.62.key:1x8x0x128,past_key_values.62.value:1x8x0x128,past_key_values.63.key:1x8x0x128,past_key_values.63.value:1x8x0x128
Traceback (most recent call last):
  File "Convert to NVIDIA TRT for RTX_32B\test_config.py", line 2, in <module>
    model = og.Model("Convert to NVIDIA TRT for RTX_32B\\model")
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Config value is longer than maximum length: 4096
```

---------

Co-authored-by: hualxie <hualxie@microsoft.com>
…t#27541)

### Description

Returns ConfigOptions object instead of a const reference.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
@Jaswanth51 Jaswanth51 requested a review from ankitm3k March 4, 2026 09:59
@ankitm3k ankitm3k merged commit 584726d into ovep-develop Mar 4, 2026
7 of 8 checks passed
@ankitm3k ankitm3k deleted the sync_msft_04032026 branch March 4, 2026 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.