Sync with Microsoft ONNX Runtime - 04/03/2026 by Jaswanth51 · Pull Request #957 · intel/onnxruntime

Jaswanth51 · 2026-03-04T09:59:16Z

Synchronizing intel/onnxruntime ovep-develop branch with latest changes from microsoft/onnxruntime master branch.

…sion (microsoft#27469) ## Summary - Fix Cast node naming collisions in `convert_float_to_float16` when nodes have empty names (common in PyTorch exports) - Fix `ALWAYS_FLOAT_INPUTS` for opset 10 Resize where scales at index 1 was unprotected - Add dedicated test suite for float16 conversion (`test_float16.py`, 8 tests) ## Motivation Fixes microsoft#14827 When `convert_float_to_float16` processes models with unnamed nodes (empty `node.name`, very common in PyTorch/TensorFlow-exported ONNX models), the generated Cast node names collide. For example, multiple Resize nodes all produce Cast nodes named `"_input_cast_2"` and output tensors named `"_input_cast_2"`, corrupting the graph with duplicate names. Additionally, the `ALWAYS_FLOAT_INPUTS` dict only protected Resize scales at index 2 (opset 11+ layout: `[X, roi, scales, sizes]`), but opset 10 Resize has scales at index 1 (`[X, scales]`), leaving it unprotected. ## Changes **`onnxruntime/python/tools/transformers/float16.py`** (11 lines changed): - Use unique tensor names (`input_name`/`output`) as the base for generated Cast node and output names, instead of potentially-empty `node.name` - Add index 1 to `ALWAYS_FLOAT_INPUTS["Resize"]` to protect opset 10 scales - Fix misleading comment ("change current node's input name" → "output name") **`onnxruntime/test/python/transformers/test_float16.py`** (new file, 8 tests): - `test_resize_opset11_cast_naming_unique` — multiple unnamed Resize nodes produce unique Cast names - `test_resize_opset11_scales_initializer_stays_fp32` — scales initializer preserved as float32 - `test_resize_opset10_scales_initializer_stays_fp32` — opset 10 scales protected at index 1 - `test_resize_opset10_multiple_unnamed_unique_names` — opset 10 naming uniqueness - `test_blocked_node_cast_naming_unique` — blocked op nodes (Upsample) also get unique Cast names - `test_resize_with_op_block_list` — Resize in op_block_list still produces unique names - `test_data_input_converted_to_fp16` — data tensor correctly converts to fp16 - `test_force_fp16_initializers` — force flag overrides protection ## Test Plan - All 8 new tests pass locally (`python -m unittest test_float16.TestFloat16Conversion -v`) - Existing `test_gpt2_past_fp16` test passes (no regression in existing float16 behavior) - `ruff check` passes on both files

…icrosoft#27458) ## Summary - Add SDPA-aware pattern matching to `FusionBartAttention` so that BART attention fusion succeeds on models exported with HuggingFace Transformers >= 4.49 - Add a synthetic BART SDPA graph generator and unit test ## Motivation Fixes microsoft#23864 HuggingFace Transformers >= 4.49 replaced `BartAttention` with `BartSdpaAttention` ([commit `2c47618`](huggingface/transformers@2c47618)), changing the ONNX export graph topology in several ways that broke `FusionBartAttention` pattern matching. Running `optimize_model(..., model_type="bart")` on these newer exports produces **zero** fused Attention nodes. ## Changes ### `fusion_bart_attention.py` The SDPA refactor introduces four structural changes to the attention subgraph. Each required a new match path: 1. **QKV output path — LayerNormalization anchor fallback** For SDPA models, symbolic shape inference often fails, which prevents SkipLayerNormalization fusion. When the anchor node is a plain `LayerNormalization` instead of `SkipLayerNormalization`, there's an extra residual `Add` between the LayerNorm and the attention output projection. Added a fallback match: `["Add", "Add", "MatMul", "Reshape", "Transpose", "MatMul"]` with `[0, None, 0, 0, 0, 0]`. 2. **QK path — NaN guard (Where + IsNaN)** SDPA wraps the Softmax output in a NaN guard: `Where(IsNaN(softmax), 0.0, softmax)`. The `Where` node's input[2] is the Softmax output. Added two new QK paths: - No mask: `["Where", "Softmax", "MatMul"]` with `[0, 2, 0]` - With mask: `["Where", "Softmax", "Add", "MatMul"]` with `[0, 2, 0, 0]` 3. **Q and K scaling paths** Instead of a single combined scale on the QK MatMul output, SDPA applies separate `Mul(1/sqrt(head_dim))` to Q and K before the QK MatMul. Added: - Q path: `["Mul", "Transpose", "Reshape", "Add", "MatMul"]` with `[0, 0, 0, 0, None]` - K path: `["Mul", "Reshape", "Transpose", "Reshape", "Transpose", "Reshape", "Add", "MatMul"]` with `[1, 0, 0, 0, 0, 0, 0, None]` (K^T uses a `Reshape→Transpose(0,2,1)→Reshape` chain) 4. **num_heads fallback for dynamic shapes** SDPA models use `-1` in reshape shape tensors for dynamic dimensions, causing `get_num_heads_and_hidden_size` to return negative values. Added a fallback to user-specified `num_heads`/`hidden_size` when detected values are invalid. ### `bart_model_generator.py` (new) Synthetic BART SDPA attention graph generator that builds a minimal but complete attention subgraph matching the SDPA topology. Tests both `with_mask=True` (decoder self-attention) and `with_mask=False` (encoder attention) variants. ### `test_attention_fusion.py` Added `test_bart_attention_sdpa_fusion` that verifies: - 1 Attention node is produced for each mask variant - Correct `num_heads` attribute - Correct `unidirectional` attribute (1 for decoder self-attention with mask, 0 for encoder) ## Test Plan - [x] `python -m pytest test_attention_fusion.py -v` — all 10 tests pass - [x] `lintrunner` on all 3 changed files — no issues - [x] Verified on real exported BART SDPA model (`hf-internal-testing/tiny-random-bart`): 2 Attention nodes fused, graph reduced from 120 → 34 nodes

…osoft#27428) Replace and reland microsoft#27129 Comparison between this PR approach and inline in softmax ## Tradeoffs | Category | Pre-conversion (current) | Inline in softmax | | :--- | :--- | :--- | | **Memory** | Extra buffer ($num\_elements \times sizeof(T)$) | None — reads 1-byte bool directly | | **Kernel launches** | +1 simple elementwise kernel | Zero extra | | **Code complexity** | 3 files, ~40 lines added | 6+ kernel templates, macros, dispatch logic, data structs | | **Risk** | Low — softmax path untested | High — modifying battle-tested softmax kernels used by MHA + GQA contrib ops | | **Perf impact** | Negligible — mask is small vs. QKV; conversion is memory-bound and fast | Slightly better theoretical bandwidth | | **Maintainability** | Clean separation of concerns | Adds template dimension across all softmax variants | --- This pull request enhances the ONNX Runtime CUDA Attention operator to support boolean attention masks (bool masks) in the Multi-Head Attention (MHA) path, converting them to additive attention bias on the GPU. It also improves test coverage to ensure correctness and parity with the CPU implementation. The main changes include implementing a CUDA kernel for mask conversion, updating the operator logic to handle bool masks, clarifying broadcasting rules, and adding comprehensive unit tests. **CUDA Attention Operator Improvements:** * Implemented a CUDA kernel (`LaunchConvertBoolMaskToAttentionBias`) that converts boolean attention masks to additive bias (True → 0.0, False → mask_filter_value) for the MHA path, ensuring efficient GPU execution. [[1]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49R148-R187) [[2]](diffhunk://#diff-8aa9a15a92d7dc138346dce5de055911895d940ba2183b4ba45bd95ac0e5bfc9R55-R66) * Updated `attention.cc` to use this kernel, correctly handle bool masks in the MHA path, and clarified the broadcasting logic and mask shape interpretation for both GQA and MHA. [[1]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffR6) [[2]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffR380-R383) [[3]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL514-L522) [[4]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL549-R557) [[5]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffR595-R616) **Testing and Documentation Enhancements:** * Added new test cases and a dedicated test class to validate the correctness of boolean mask handling in the MHA path, ensuring parity with the CPU implementation for 2D, 3D, and 4D mask shapes. [[1]](diffhunk://#diff-801fbbcf2537e8e13a0202e6a0f7e88c56ab5aa72d17d949a5556355694b2b2dR563-R725) [[2]](diffhunk://#diff-801fbbcf2537e8e13a0202e6a0f7e88c56ab5aa72d17d949a5556355694b2b2dR893-R922) * Improved comments and documentation in both code and tests to clarify ONNX broadcasting rules and mask shape expectations for different attention paths. [[1]](diffhunk://#diff-4ed1461afda0d3804a61ba95a64b2a84d0c1395f9c887d1a3fdfed914ade22c1L208-R221) [[2]](diffhunk://#diff-801fbbcf2537e8e13a0202e6a0f7e88c56ab5aa72d17d949a5556355694b2b2dR35) **Test Coverage and Reliability:** * Enabled CUDA-based tests for boolean mask scenarios previously only tested on CPU, and adjusted test logic to ensure correct handling of edge cases (e.g., all-false masks). [[1]](diffhunk://#diff-3ff6dfa2ce407ae0073009174c37d1756509e8bbc434dee7c44cd55a996bb777L477-R480) [[2]](diffhunk://#diff-3ff6dfa2ce407ae0073009174c37d1756509e8bbc434dee7c44cd55a996bb777L620-R623) These changes make the CUDA Attention operator more robust and feature-complete, aligning its behavior with the CPU implementation and ONNX specifications.

…icrosoft#27470) Heap-allocate `WebGpuContextFactory::contexts_` to avoid crashes during static destruction when dependent DLLs (e.g. dxcompiler.dll) have already been unloaded. `Cleanup()` explicitly deletes the map; on abnormal termination it safely leaks. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…27480) ### Description ### Motivation and Context Resolves ongoing security issues with auditing the jar testing pipeline in the backend.

…7478) ### Description  ### Motivation and Context

…icrosoft#27474) ### Description Improves the error message returned by ORT when loading a compiled model in a session that does not have the required execution provider(s). For example, ORT now returns the following error when creating session with a model compiled explicitly for OpenVINO EP without adding OpenVINO EP to the session: > EPContext node generated by 'OpenVINOExecutionProvider' is not compatible with any execution provider added to the session. EPContext node name: 'EPContextNode0'. Available session execution providers: [CPUExecutionProvider]. Compare the above message with the more generic message that is currently returned: >Could not find an implementation for EPContext(1) node with name 'EPContextNode0' ### Motivation and Context Improves diagnosability when loading of pre-compiled models fails. Specifically, the ambiguity of the original message led to many hours spent debugging an error where a compiled model failed to run because the expected `OrtEpDevice` was inadvertently not added to a session.

… compatibility (microsoft#27484) This pull request refactors validation logic for CUDA attention masks and tensor scatter operations to move error checking from host-side (CPU) to device-side (GPU) using CUDA kernel assertions (`CUDA_KERNEL_ASSERT`). This change eliminates synchronous host-device memory transfers and stream synchronizations, improving performance and simplifying code. Corresponding test cases are updated to only expect validation failures on the CPU, as CUDA errors are now asynchronous. Key changes: **Attention mask validation (GQA path):** - Removes host-side validation and memory copies for boolean attention masks in `attention.cc`; mask validity (right-padding, contiguous True/False) is now checked asynchronously via `CUDA_KERNEL_ASSERT` in the CUDA kernel. [[1]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL385-L387) [[2]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL414-L418) [[3]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL427-L448) - Updates the CUDA kernel and its interface to drop the `validation_result` buffer and rely on device assertions for mask validation. Documentation is updated to reflect this asynchronous error checking. [[1]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L10-R17) [[2]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L34) [[3]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L81-R76) [[4]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L104-R92) [[5]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L118) [[6]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L137) [[7]](diffhunk://#diff-8aa9a15a92d7dc138346dce5de055911895d940ba2183b4ba45bd95ac0e5bfc9L37-L45) **TensorScatter write_indices validation:** - Removes host-side validation and synchronization for `write_indices` in `tensorscatter.cc`; index bounds checking is now performed asynchronously inside the CUDA kernel via `CUDA_KERNEL_ASSERT`. [[1]](diffhunk://#diff-d69233ff3987fe3093132a31710b6b64cc0a32140e2a5a415a2f1f0907bd22d2L75-R76) [[2]](diffhunk://#diff-1694a04b8ba9963cc06d651ec6a3be8aa9cb2bcb73c2438dc251ca8cdcb2eb41L31-R37) **Test updates:** - Updates negative test cases for `TensorScatter` to run only on CPU, since CUDA now validates asynchronously and will not synchronously return errors to the host. [[1]](diffhunk://#diff-8c90e642cc0cf4e68b2f3d4e4b3f1e21bf6d07f01663d424bc52c75ad0db2dfeR300) [[2]](diffhunk://#diff-8c90e642cc0cf4e68b2f3d4e4b3f1e21bf6d07f01663d424bc52c75ad0db2dfeL311-R319) [[3]](diffhunk://#diff-8c90e642cc0cf4e68b2f3d4e4b3f1e21bf6d07f01663d424bc52c75ad0db2dfeL327-R339) [[4]](diffhunk://#diff-8c90e642cc0cf4e68b2f3d4e4b3f1e21bf6d07f01663d424bc52c75ad0db2dfeL342-R354)

…crosoft#27136) ### Description Introduces a backend kernel selector config struct in MLAS that allows users to configure selection of backend kernels at runtime based on their preference. The immediate use-case of such a feature is to allow users to opt-out of using/selecting KleidiAI kernels should they choose to do so on ARM platforms. This solution should scale to other kernel implementation backends in the future. ### Motivation and Context Allow users to opt-out of using/selecting KleidiAI kernels should they choose to do so on ARM platforms --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

### Description This PR adds a few headers for supporting building WebGPU EP and CUDA EP as plugin EPs. See summary of microsoft#26907

…lation… (microsoft#27359) ### Description This change optimises the GridSample operator in onnxrt. 1- For the GridSample nodes having similar characteristics the camera based 3D object detection model in MLPerf Automotive space, transforming the input to output coordinates with : 2D interpolation mode = linear with padding mode = zeros, align_corners = 0, fast path added. Linear interpolation: For each (x, y), the code locates the four surrounding integer pixel centers: (x1, y1) = (floor(x), floor(y)) (top-left) (x2, y2) = (x1 + 1, y1 + 1) (bottom-right) The interpolation weights reflect the fractional positions: dx1 = x - x1, dx2 = x2 - x dy1 = y - y1, dy2 = y2 - y The resulting value is the bilinear blend dy2 * (dx2 * p11 + dx1 * p12) + dy1 * (dx2 * p21 + dx1 * p22) where p11…p22 are the input pixels at those four neighbor coordinates. Padding mode = zeros: Any neighbor index that falls outside [0, W_in-1] × [0, H_in-1] contributes 0 to the interpolation. Each output pixel (oy, ox) carries normalized coordinates (nx, ny) in [-1, 1]. With align_corners=0, nx = -1 corresponds to a location half a pixel before the leftmost input column (i.e., x = -0.5), and nx = 1 corresponds to half a pixel beyond the rightmost column (x = W_in - 0.5). Same idea vertically for ny. Fast Path Optimisation : The implementation can precompute all neighbor indices/weights for each output pixel once (they depend only on the grid), then reuse them for every channel. Previously indices and weights were calculated inside the loops which can be as much as (H_out*W_out like 20,000 per batch element in one case) x 32 Channels. 2-optional ARM NEON vectorization added : - vld1_f32(ptr): loads two contiguous float values into a float32x2_t. Used to read the top and bottom neighbor pairs ([p11, p12], [p21, p22]). - vcombine_f32(low, high): concatenates two float32x2_t values into one float32x4_t, giving [p11, p12, p21, p22]. - vdup_n_f32(val): duplicates a scalar float into both lanes of a float32x2_t. - vset_lane_f32(val, vec, lane): writes val into the specified lane of a float32x2_t, letting us form [w11, w12] and [w21, w22]. - vmulq_f32(a, b): multiplies two float32x4_t vectors element-wise (neighbor pixels × weights). - vget_low_f32(vec) / vget_high_f32(vec): extract the lower or upper 2 lanes from a float32x4_t as float32x2_t. - vadd_f32(a, b): adds two float32x2_t vectors element-wise (forming partial sums). - vpadd_f32(a, b): performs pairwise adds within and across two float32x2_t vectors, collapsing four elements down to two. - vget_lane_f32(vec, lane): reads a scalar from a specific lane, giving the final interpolated value. Most of the performance uplift coming from the 1st optimisation. 2nd optimisation using NEON intrinsics still contributes but not as much as the 1st one. Overall performance improvement : 1 thread : <img width="902" height="766" alt="image" src="https://github.com/user-attachments/assets/d1fadc6d-370d-4750-baee-1123c7d18af3" /> 2 threads: <img width="902" height="766" alt="image" src="https://github.com/user-attachments/assets/69c86fd6-815a-4b52-8f86-615f1c99bf0a" /> ### Motivation and Context The fast path handles denormalisation of the linear coordinates and can handle the derivation of the indices by precomputing a separate plan entry per output pixel. In PrecomputeBilinearSamplePlan2D, the loop runs across all H_out * W_out points, using the right nx/ny for each (oy, ox) and storing that point’s four indices, four weights, and mask in plans[idx]. During evaluation, EvaluatePlanForChannel iterates through the same point_count(H_out*W_out) and uses the matching plan entry for each (oy, ox). So we are not reusing one plan across different spatial positions; we precompute one plan per output location and reuse it only across channels (which share the same grid). --------- Signed-off-by: melkap01 <melike.kaptan@arm.com> Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

microsoft#27483) ### Description This PR reduces the size of the memory allocation for expected outputs from ~4GiB to ~2GiB in the Gather_overflow_check test. The updated test still verifies that the integer overflow fix from PR microsoft#27444 is valid. That is, that the CPU Gather operator correctly handles output tensors with element counts that exceed INT32_MAX. Changes: - Reduced test dimension from 65537 to 46341 (output shape from 65537×65537 to 46341×46341), which results in a total number of elements that is just over INT32_MAX, which is required to test the bug fix. - The peak memory usage is reduced to ~4GiB + overhead. - Increase Android emulator memory to 5GiB (from 4GiB) to be able to run the test. ### Motivation Android CI fails to run the unit test introduced in microsoft#27444 due to memory usage that exceeds the Android emulator's default memory of 4GiB. This PR lowers the peak memory usage of the unit test and increases the Android emulator's memory by 1GiB.

…() (microsoft#27295) Remove unnecessary s_kernel_registry_vitisaiep.reset() call in deinitialize_vitisai_ep() function. The kernel registry will be repopulated on next initialization, making this reset redundant.

…hains (microsoft#27391) ### Description Profiling shows that CheckIfSubtreesAreEqual is invoked recursively for many node pairs for LightGBM models with categorical features. A significant portion of this work consists of self-comparisons (left_id == right_id), leading to effectively O(n²) comparing trees to themselves during model loading. This change adds a fast-path for trivial equality, avoiding unnecessary recursive comparisons. Example results: - model with 7K BRANCH_EQ nodes: 527 ms → 47 ms (~11× faster) - model with 106K BRANCH_EQ nodes: 141 s → 80 ms (~1760× faster) ### Motivation and Context We have some LightGBM exported models that make heavy use of categorical features and exhibit extremely slow load times (minutes for a single 2.5mb model). Heres a diagram to illustrate the issue: <img width="1008" height="1229" alt="image" src="https://github.com/user-attachments/assets/348e16cb-9eec-448f-ac5c-e1edb60e2a3d" /> the 106K model has much longer "member of" chains, with chains that lead into more chains: <details> <summary>"trees"</summary> <img width="1405" height="593" alt="image" src="https://github.com/user-attachments/assets/12f0c43f-5987-4b33-9001-2a2b526e537f" /> </details> Interestingly we did also try using the new onnx.ml opset 5 node that has MEMBER, but it seems even slower as it recreates these branch EQ chains.

) ### Description Add support for pre-layer normalization (pre-LN) transformer architectures in the Python attention fusion optimizer. ### Motivation and Context Fixes microsoft#11684 Pre-LN models (used in GPT-3, ViT variants, and many modern architectures) apply LayerNormalization **before** attention rather than after. The first block of a pre-LN model has no `Add` node before its first `LayerNormalization` — its input comes directly from a graph input. This caused `FusionAttention.fuse()` to bail out early because it assumed every `LayerNormalization` anchor has an `Add` parent (the residual connection from the previous block). This PR makes four surgical changes to `fusion_attention.py` so that pre-LN first-block models fuse correctly, while preserving all existing post-LN behavior: 1. **Allow LN with graph-input parent** — instead of returning early when no `Add` parent is found, check whether the input is a graph input and continue 2. **Include graph inputs in residual collection** — the `other_inputs` loop previously skipped anything not in `output_name_to_node`; graph inputs are now recognized 3. **Extend child-LN resolution to SkipLN anchors** — after `fuse_skip_layer_norm()` runs, the anchor becomes `SkipLayerNormalization`; the redirect from `root_input` (graph input) to the first LN's output now fires for SkipLN anchors too 4. **Guard `output_name_to_node` lookup** — graph inputs are not node outputs, so the dictionary access is now guarded ### Changes - `onnxruntime/python/tools/transformers/fusion_attention.py` — 4 targeted edits to `FusionAttention.fuse()` - `onnxruntime/test/python/transformers/bert_model_generator.py` — new `create_bert_attention_pre_ln()` test graph generator - `onnxruntime/test/python/transformers/test_attention_fusion.py` — new `test_attention_fusion_pre_ln()` test ### Test Plan - [x] New unit test `test_attention_fusion_pre_ln` passes — verifies `Attention` fused op appears in the optimized graph - [x] Lintrunner passes on all changed files (no lint issues) - [x] Changes are minimal and scoped to the pre-LN first-block gap

…osoft#27490) ### Description  Split out Linux CUDA Python package builds into separate stages. ### Motivation and Context  Reduce overall packaging pipeline time by running Python version builds in separate stages, allowing them to run in parallel. Example build: https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=1102253&view=results Reduced time from ~3h30m to 1h38m.

…ompiler on Linux builds (microsoft#27454) ### Description Suppress spurious Array Out of Bounds warnings produced by GCC 14.2 compiler on Linux builds ### Motivation and Context Linux build fails when compiled with GCC 14.2 due to spurious Array Out of Bounds warnings (Warnings Treated as Errors)

@tianleiwu

…ft#27471) ### Description Apply the same double-free fix from NvTensorRtRtx EP ([PR microsoft#27192](microsoft#27192)) to the TRT EP. `CreateTensorRTCustomOpDomainList()` owns domains/ops via static `unique_ptr`s, but `ReleaseTensorRTCustomOpDomain()` was manually `delete`-ing the same objects through raw pointers — double-free at program exit. - `ReleaseTensorRTCustomOpDomain()` → no-op (static `unique_ptr`s own the lifetime) - `ReleaseTensorRTCustomOpDomainList()` → `clear()` the reference vector only - Added ownership comments to static members matching NvTensorRtRtx EP style ### Motivation and Context PR microsoft#27192 review ([thread](microsoft#27192 (comment))) identified TRT EP has the identical bug pattern that was fixed in NvTensorRtRtx EP. The TRT EP code was the original source this pattern was borrowed from. @tianleiwu noted a follow-up PR was needed.  --- 🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. [Learn more about Advanced Security.](https://gh.io/cca-advanced-security) --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>

### Description The `-Warray-bounds` suppression pragma in `sqnbitgemm_kernel_avx2_int8_blklen32.h` was gated on `defined(HAS_ARRAY_BOUNDS)`, which is set in `onnxruntime_config.h`. MLAS never includes that header, so the guard was dead code and the pragma never fired. Changed the guard to `#ifdef __clang__`: ```cpp // Before: HAS_ARRAY_BOUNDS never defined in MLAS TU #if defined(__clang__) && defined(HAS_ARRAY_BOUNDS) // After #ifdef __clang__ ``` Note: `__has_warning("-Warray-bounds")` was considered but the C preprocessor does not short-circuit `&&`, so GCC fails to parse it even behind `defined(__clang__)`. ### Motivation and Context Build fails on Intel Mac with Apple Clang 17.0.0 (`-Werror,-Warray-bounds`). Clang raises a false-positive array-bounds warning on `acc[4..7]` inside an `if constexpr (NCols4 == 8)` branch that is dead when `NCols4 == 4`.  <details> <summary>Original prompt</summary> > > ---- > > *This section details on the original issue you should resolve* > > <issue_title>[Build] error: array index 4 is past the end of the array (that has type '__m256[4]') [-Werror,-Warray-bounds]</issue_title> > <issue_description>### Describe the issue > > Unable to build from main branch (0768f42 as of time writing this issue) on Intel Mac > > ``` > /usr/bin/c++ --version > Apple clang version 17.0.0 (clang-1700.0.13.5) > Target: x86_64-apple-darwin24.5.0 > Thread model: posix > InstalledDir: /Library/Developer/CommandLineTools/usr/bin > ``` > > > ### Urgency > > _No response_ > > ### Target platform > > MacOS > > ### Build script > > ./build.sh --config RelWithDebInfo --build_shared_lib --parallel --cmake_extra_defines CMAKE_OSX_ARCHITECTURES=x86_64 > > ### Error / output > > [ 18%] Building CXX object CMakeFiles/onnxruntime_mlas.dir/onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2.cpp.o > In file included from /onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2.cpp:26: > /onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1677:49: error: array index 4 is past the end of the array (that has type '__m256[4]') [-Werror,-Warray-bounds] > 1677 | __m128 acc_1 = FoldAccumulators(acc[4], acc[5], acc[6], acc[7]); > | ^ ~ > /onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1531:13: note: array 'acc' declared here > 1531 | __m256 acc[NCols4]; > | ^ > /onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1677:57: error: array index 5 is past the end of the array (that has type '__m256[4]') [-Werror,-Warray-bounds] > 1677 | __m128 acc_1 = FoldAccumulators(acc[4], acc[5], acc[6], acc[7]); > | ^ ~ > /onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1531:13: note: array 'acc' declared here > 1531 | __m256 acc[NCols4]; > | ^ > /onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1677:65: error: array index 6 is past the end of the array (that has type '__m256[4]') [-Werror,-Warray-bounds] > 1677 | __m128 acc_1 = FoldAccumulators(acc[4], acc[5], acc[6], acc[7]); > | ^ ~ > /onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1531:13: note: array 'acc' declared here > 1531 | __m256 acc[NCols4]; > | ^ > /onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1677:73: error: array index 7 is past the end of the array (that has type '__m256[4]') [-Werror,-Warray-bounds] > 1677 | __m128 acc_1 = FoldAccumulators(acc[4], acc[5], acc[6], acc[7]); > | ^ ~ > /onnxruntime/onnxruntime/core/mlas/lib/sqnbitgemm_kernel_avx2_int8_blklen32.h:1531:13: note: array 'acc' declared here > 1531 | __m256 acc[NCols4]; > | ^ > 4 errors generated. > > ### Visual Studio Version > > _No response_ > > ### GCC / Compiler Version > > Apple clang version 17.0.0 (clang-1700.0.13.5)</issue_description> > > <agent_instructions>Please investigate the build error. If code need fix, create a pull requests. Otherwise, suggest ways to avoid the build errors.</agent_instructions> > > ## Comments on the Issue (you are @copilot in this section) > > <comments> > </comments> > </details>  - Fixes microsoft#27497  --- 🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. [Learn more about Advanced Security.](https://gh.io/cca-advanced-security) --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>

There is build error using `--use_vcpkg` without `--use_vcpkg_ms_internal_asset_cache`, the error is like: ``` C:\code\onnxruntime\cmake\./vcpkg-ports\pybind11: info: installing overlay port from here Downloading https://github.com/pybind/pybind11/archive/v3.0.2.tar.gz -> pybind-pybind11-v3.0.2.tar.gz pybind-pybind11-v3.0.2.tar.gz.33772.part: error: download from https://github.com/pybind/pybind11/archive/v3.0.2.tar.gz had an unexpected hash note: Expected: 786b1bf534ac67a8d5669f8babf67bb13e48b3a3da1b6344e43ae10a84b80bbc8fea5f12a65fd18739c341fefef5622c5dc096db964dff33cc62ea4259b2e2c1 note: Actual : 19bee2c76320e25202ee078b5680ff8a7acfb33494dec29dad984ab04de8bcb01340d9fec37c8cc5ac9015dfc367e60312dcd8506e66ce8f0af4c49db562ddef CMake Error at scripts/cmake/vcpkg_download_distfile.cmake:136 (message): Download failed, halting portfile. ``` The root cause is that I uploaded zip file to cache server. Without `--use_vcpkg_ms_internal_asset_cache`, vcpkg will try download tar.gz file from github, and the SHA is different from the one of zip file. In this PR, I configure the portfile to download zip file to avoid the issue.

### Description setCudaGraphStrategy(kWHOLE_GRAPH_CAPTURE) was present in the dynamic engine build path (CreateNodeComputeInfoFromGraph) but missing from the precompiled/AOT engine path (CreateNodeComputeInfoFromEPContext). Since TRT RTX defaults the CUDA Graph strategy to kDISABLED, CUDA Graph capture never occurred when loading precompiled engines. Applied the same setCudaGraphStrategy call (guarded by the existing TRT_MAJOR_RTX >= 1.3 version check) to the precompiled path to match the dynamic path behavior ### Motivation and Context Fixes [microsoft#27329](microsoft#27329) — users reported that cudaGraphLaunch was not occurring when using precompiled (AOT-built) TensorRT-RTX engines, causing individual kernel launches and unnecessary CPU overhead instead of batched graph execution.

@tianleiwu

…7489) ## Summary - Allow `SkipLayerNormalization` fusion when Add inputs have broadcast-compatible shapes - Add `get_skip_index()` to identify skip tensor and ensure correct input ordering - Add tests for 2D `(S, H)` and 3D `(1, S, H)` broadcast skip shapes ## Motivation Fixes microsoft#27488 The Python optimizer's `FusionSkipLayerNormalization` rejects the `Add → LayerNormalization` fusion when the two Add inputs have different but broadcast-compatible shapes. The C++ `SkipLayerNormalization` kernel already supports broadcasting (skip can be 2D or 3D with last two dims matching input), but the Python fusion used an exact shape equality check that blocked these cases. This resolves the existing TODO comments from @tianleiwu. ## Changes - **`fusion_skiplayernorm.py`**: Added `_is_broadcast_skip()` helper and `get_skip_index()` method (following the pattern from `FusionSkipGroupNorm`). Replaced strict `compare_shape()` equality check with broadcast-aware logic. Used `skip_index` to ensure the full-sized input goes to position 0 and the skip to position 1 in the fused node. - **`test_skip_layer_norm_fusion.py`**: Added `create_broadcast_test_model()` and 4 new test cases covering 2D/3D broadcast skip shapes in both Add input positions. ## Test Plan - [x] All 6 existing `test_skip_layer_norm_fusion` tests pass (no regressions) - [x] All 4 new broadcast tests pass - [x] `ruff check` passes on changed files

…icrosoft#27518) ### Description  Detect and test mismatch between raw data size and declared data type and shape of the lora adapter parameter. ### Motivation and Context  Disallow maliciously crafted lora adapters leading to heap OOB. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…ngs (microsoft#27288) ### Description  Compilation on Clang toolchains on Linux currently fails due to this warning (among others) since ONNX runtime compiles with -Werror by default. We address `-Winconsistent-missing-override` with this PR in TRT NV EP. ``` /home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/nv_tensorrt_rtx/nv_execution_provider.h:309:7: error: 'GetDeviceId' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override] 309 | int GetDeviceId() const { return device_id_; } | ^ /home/stephan/projects/onnxruntime-winai/include/onnxruntime/core/framework/execution_provider.h:183:15: note: overridden virtual function is here 183 | virtual int GetDeviceId() const { return default_device_.Id(); } | ^ In file included from /home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc:18: /home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/nv_tensorrt_rtx/nv_execution_provider.h:310:10: error: 'Sync' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override] 310 | Status Sync() const; | ^ /home/stephan/projects/onnxruntime-winai/include/onnxruntime/core/framework/execution_provider.h:231:26: note: overridden virtual function is here 231 | virtual common::Status Sync() const { return Status::OK(); } | ^ /home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc:63:39: error: 'CreateProvider' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override] 63 | std::unique_ptr<IExecutionProvider> CreateProvider(const OrtSessionOptions& session_options, | ^ /home/stephan/projects/onnxruntime-winai/include/onnxruntime/core/providers/providers.h:29:47: note: overridden virtual function is here 29 | virtual std::unique_ptr<IExecutionProvider> CreateProvider(const OrtSessionOptions& session_options, | ^ /home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc:112:46: error: 'CreateExecutionProviderFactory' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override] 112 | std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory(const void* param) { | ^ /home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/shared_library/provider_host_api.h:19:54: note: overridden virtual function is here 19 | virtual std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory(const void* /*provider_options*/) { return nullptr; /home/stephan/projects/onnxruntime-winai/onnxruntime/core/providers/nv_tensorrt_rtx/nv_execution_provider.h:309:7: error: 'GetDeviceId' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override] 309 | int GetDeviceId() const { return device_id_; } | ^ /home/stephan/projects/onnxruntime-winai/include/onnxruntime/core/framework/execution_provider.h:183:15: note: overridden virtual function is here 183 | virtual int GetDeviceId() const { return default_device_.Id(); } ``` ### Motivation and Context  Fixing clang warnings enables builds with clang on Linux since `-Werror` enforces warning-free builds.

… and provider bridge EPs (microsoft#27522) ### Description  Set "library_path" metadata entry in OrtEpDevice instances for plugin and provider bridge EPs. ### Motivation and Context  Make available everywhere. Required by GenAI to load custom ops library. microsoft#27496

…rs on Windows (microsoft#27039) ### Description  Disable cmake prologue in pch file which results in warnings being unexpectedly suppressed in the unit test projects that use precompiled headers. e.g. C4834 was suppressed so there was no warning for the nodiscard Status return value not being checked in a Windows build. Generates a warning on other platforms. Update a few tests to resolve warnings that now get triggered. ### Motivation and Context  Warnings causing build failures for non-Windows CIs that should have been generated for Windows builds. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

### Description YAML that's no longer used. ### Motivation and Context

### Description  Non required builds fail because Lora Tests use `ASSERT_THROW` while RTTI is disabled. Followup: microsoft#27518

### Motivation and Context Change this because for 32B model like Qwen2.5-coder-32B in TRTRTX ep, there is a long string in GenAI https://github.com/microsoft/onnxruntime-genai/blob/3c47932e9d7afa0d44db0b3918e479bbdd4c5353/src/models/model.cpp#L516 Example ``` AddConfigEntry: ep.nvtensorrtrtxexecutionprovider.nv_profile_min_shapes (length=4364) = input_ids:1x1,attention_mask:1x1,past_key_values.0.key:1x8x0x128,past_key_values.0.value:1x8x0x128,past_key_values.1.key:1x8x0x128,past_key_values.1.value:1x8x0x128,past_key_values.2.key:1x8x0x128,past_key_values.2.value:1x8x0x128,past_key_values.3.key:1x8x0x128,past_key_values.3.value:1x8x0x128,past_key_values.4.key:1x8x0x128,past_key_values.4.value:1x8x0x128,past_key_values.5.key:1x8x0x128,past_key_values.5.value:1x8x0x128,past_key_values.6.key:1x8x0x128,past_key_values.6.value:1x8x0x128,past_key_values.7.key:1x8x0x128,past_key_values.7.value:1x8x0x128,past_key_values.8.key:1x8x0x128,past_key_values.8.value:1x8x0x128,past_key_values.9.key:1x8x0x128,past_key_values.9.value:1x8x0x128,past_key_values.10.key:1x8x0x128,past_key_values.10.value:1x8x0x128,past_key_values.11.key:1x8x0x128,past_key_values.11.value:1x8x0x128,past_key_values.12.key:1x8x0x128,past_key_values.12.value:1x8x0x128,past_key_values.13.key:1x8x0x128,past_key_values.13.value:1x8x0x128,past_key_values.14.key:1x8x0x128,past_key_values.14.value:1x8x0x128,past_key_values.15.key:1x8x0x128,past_key_values.15.value:1x8x0x128,past_key_values.16.key:1x8x0x128,past_key_values.16.value:1x8x0x128,past_key_values.17.key:1x8x0x128,past_key_values.17.value:1x8x0x128,past_key_values.18.key:1x8x0x128,past_key_values.18.value:1x8x0x128,past_key_values.19.key:1x8x0x128,past_key_values.19.value:1x8x0x128,past_key_values.20.key:1x8x0x128,past_key_values.20.value:1x8x0x128,past_key_values.21.key:1x8x0x128,past_key_values.21.value:1x8x0x128,past_key_values.22.key:1x8x0x128,past_key_values.22.value:1x8x0x128,past_key_values.23.key:1x8x0x128,past_key_values.23.value:1x8x0x128,past_key_values.24.key:1x8x0x128,past_key_values.24.value:1x8x0x128,past_key_values.25.key:1x8x0x128,past_key_values.25.value:1x8x0x128,past_key_values.26.key:1x8x0x128,past_key_values.26.value:1x8x0x128,past_key_values.27.key:1x8x0x128,past_key_values.27.value:1x8x0x128,past_key_values.28.key:1x8x0x128,past_key_values.28.value:1x8x0x128,past_key_values.29.key:1x8x0x128,past_key_values.29.value:1x8x0x128,past_key_values.30.key:1x8x0x128,past_key_values.30.value:1x8x0x128,past_key_values.31.key:1x8x0x128,past_key_values.31.value:1x8x0x128,past_key_values.32.key:1x8x0x128,past_key_values.32.value:1x8x0x128,past_key_values.33.key:1x8x0x128,past_key_values.33.value:1x8x0x128,past_key_values.34.key:1x8x0x128,past_key_values.34.value:1x8x0x128,past_key_values.35.key:1x8x0x128,past_key_values.35.value:1x8x0x128,past_key_values.36.key:1x8x0x128,past_key_values.36.value:1x8x0x128,past_key_values.37.key:1x8x0x128,past_key_values.37.value:1x8x0x128,past_key_values.38.key:1x8x0x128,past_key_values.38.value:1x8x0x128,past_key_values.39.key:1x8x0x128,past_key_values.39.value:1x8x0x128,past_key_values.40.key:1x8x0x128,past_key_values.40.value:1x8x0x128,past_key_values.41.key:1x8x0x128,past_key_values.41.value:1x8x0x128,past_key_values.42.key:1x8x0x128,past_key_values.42.value:1x8x0x128,past_key_values.43.key:1x8x0x128,past_key_values.43.value:1x8x0x128,past_key_values.44.key:1x8x0x128,past_key_values.44.value:1x8x0x128,past_key_values.45.key:1x8x0x128,past_key_values.45.value:1x8x0x128,past_key_values.46.key:1x8x0x128,past_key_values.46.value:1x8x0x128,past_key_values.47.key:1x8x0x128,past_key_values.47.value:1x8x0x128,past_key_values.48.key:1x8x0x128,past_key_values.48.value:1x8x0x128,past_key_values.49.key:1x8x0x128,past_key_values.49.value:1x8x0x128,past_key_values.50.key:1x8x0x128,past_key_values.50.value:1x8x0x128,past_key_values.51.key:1x8x0x128,past_key_values.51.value:1x8x0x128,past_key_values.52.key:1x8x0x128,past_key_values.52.value:1x8x0x128,past_key_values.53.key:1x8x0x128,past_key_values.53.value:1x8x0x128,past_key_values.54.key:1x8x0x128,past_key_values.54.value:1x8x0x128,past_key_values.55.key:1x8x0x128,past_key_values.55.value:1x8x0x128,past_key_values.56.key:1x8x0x128,past_key_values.56.value:1x8x0x128,past_key_values.57.key:1x8x0x128,past_key_values.57.value:1x8x0x128,past_key_values.58.key:1x8x0x128,past_key_values.58.value:1x8x0x128,past_key_values.59.key:1x8x0x128,past_key_values.59.value:1x8x0x128,past_key_values.60.key:1x8x0x128,past_key_values.60.value:1x8x0x128,past_key_values.61.key:1x8x0x128,past_key_values.61.value:1x8x0x128,past_key_values.62.key:1x8x0x128,past_key_values.62.value:1x8x0x128,past_key_values.63.key:1x8x0x128,past_key_values.63.value:1x8x0x128 Traceback (most recent call last): File "Convert to NVIDIA TRT for RTX_32B\test_config.py", line 2, in <module> model = og.Model("Convert to NVIDIA TRT for RTX_32B\\model") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Config value is longer than maximum length: 4096 ``` --------- Co-authored-by: hualxie <hualxie@microsoft.com>

…t#27541) ### Description Returns ConfigOptions object instead of a const reference. ### Motivation and Context

Rishi-Dave and others added 30 commits February 27, 2026 06:58

Move JAR testing pipelines to canonical pipeline template (microsoft#…

2604d31

…27480) ### Description ### Motivation and Context Resolves ongoing security issues with auditing the jar testing pipeline in the backend.

webgpu: support int64 for Unsqueeze and Expand operators (microsoft#2…

a16cf05

…7478) ### Description  ### Motivation and Context

[EP API] header-only adapter for EP API (microsoft#26919)

4fd9ebb

### Description This PR adds a few headers for supporting building WebGPU EP and CUDA EP as plugin EPs. See summary of microsoft#26907

Remove some dead code. (microsoft#27320)

eaf8d94

### Description YAML that's no longer used. ### Motivation and Context

Account for ORT_NO_EXCEPTIONS builds in Lora test (microsoft#27537)

fdead1c

### Description  Non required builds fail because Lora Tests use `ASSERT_THROW` while RTTI is disabled. Followup: microsoft#27518

Merge branch 'master' into sync_msft_04032026

6a95a50

Jaswanth51 requested a review from ankitm3k March 4, 2026 09:59

ankitm3k approved these changes Mar 4, 2026

View reviewed changes

ankitm3k merged commit 584726d into ovep-develop Mar 4, 2026
7 of 8 checks passed

ankitm3k deleted the sync_msft_04032026 branch March 4, 2026 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync with Microsoft ONNX Runtime - 04/03/2026#957

Sync with Microsoft ONNX Runtime - 04/03/2026#957
ankitm3k merged 31 commits intoovep-developfrom
sync_msft_04032026

Jaswanth51 commented Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

Jaswanth51 commented Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants