Sync with Microsoft ONNX Runtime - 02/03/2026#955
Closed
Jaswanth51 wants to merge 16 commits intoovep-developfrom
Closed
Sync with Microsoft ONNX Runtime - 02/03/2026#955Jaswanth51 wants to merge 16 commits intoovep-developfrom
Jaswanth51 wants to merge 16 commits intoovep-developfrom
Conversation
…sion (microsoft#27469) ## Summary - Fix Cast node naming collisions in `convert_float_to_float16` when nodes have empty names (common in PyTorch exports) - Fix `ALWAYS_FLOAT_INPUTS` for opset 10 Resize where scales at index 1 was unprotected - Add dedicated test suite for float16 conversion (`test_float16.py`, 8 tests) ## Motivation Fixes microsoft#14827 When `convert_float_to_float16` processes models with unnamed nodes (empty `node.name`, very common in PyTorch/TensorFlow-exported ONNX models), the generated Cast node names collide. For example, multiple Resize nodes all produce Cast nodes named `"_input_cast_2"` and output tensors named `"_input_cast_2"`, corrupting the graph with duplicate names. Additionally, the `ALWAYS_FLOAT_INPUTS` dict only protected Resize scales at index 2 (opset 11+ layout: `[X, roi, scales, sizes]`), but opset 10 Resize has scales at index 1 (`[X, scales]`), leaving it unprotected. ## Changes **`onnxruntime/python/tools/transformers/float16.py`** (11 lines changed): - Use unique tensor names (`input_name`/`output`) as the base for generated Cast node and output names, instead of potentially-empty `node.name` - Add index 1 to `ALWAYS_FLOAT_INPUTS["Resize"]` to protect opset 10 scales - Fix misleading comment ("change current node's input name" → "output name") **`onnxruntime/test/python/transformers/test_float16.py`** (new file, 8 tests): - `test_resize_opset11_cast_naming_unique` — multiple unnamed Resize nodes produce unique Cast names - `test_resize_opset11_scales_initializer_stays_fp32` — scales initializer preserved as float32 - `test_resize_opset10_scales_initializer_stays_fp32` — opset 10 scales protected at index 1 - `test_resize_opset10_multiple_unnamed_unique_names` — opset 10 naming uniqueness - `test_blocked_node_cast_naming_unique` — blocked op nodes (Upsample) also get unique Cast names - `test_resize_with_op_block_list` — Resize in op_block_list still produces unique names - `test_data_input_converted_to_fp16` — data tensor correctly converts to fp16 - `test_force_fp16_initializers` — force flag overrides protection ## Test Plan - All 8 new tests pass locally (`python -m unittest test_float16.TestFloat16Conversion -v`) - Existing `test_gpt2_past_fp16` test passes (no regression in existing float16 behavior) - `ruff check` passes on both files
…icrosoft#27458) ## Summary - Add SDPA-aware pattern matching to `FusionBartAttention` so that BART attention fusion succeeds on models exported with HuggingFace Transformers >= 4.49 - Add a synthetic BART SDPA graph generator and unit test ## Motivation Fixes microsoft#23864 HuggingFace Transformers >= 4.49 replaced `BartAttention` with `BartSdpaAttention` ([commit `2c47618`](huggingface/transformers@2c47618)), changing the ONNX export graph topology in several ways that broke `FusionBartAttention` pattern matching. Running `optimize_model(..., model_type="bart")` on these newer exports produces **zero** fused Attention nodes. ## Changes ### `fusion_bart_attention.py` The SDPA refactor introduces four structural changes to the attention subgraph. Each required a new match path: 1. **QKV output path — LayerNormalization anchor fallback** For SDPA models, symbolic shape inference often fails, which prevents SkipLayerNormalization fusion. When the anchor node is a plain `LayerNormalization` instead of `SkipLayerNormalization`, there's an extra residual `Add` between the LayerNorm and the attention output projection. Added a fallback match: `["Add", "Add", "MatMul", "Reshape", "Transpose", "MatMul"]` with `[0, None, 0, 0, 0, 0]`. 2. **QK path — NaN guard (Where + IsNaN)** SDPA wraps the Softmax output in a NaN guard: `Where(IsNaN(softmax), 0.0, softmax)`. The `Where` node's input[2] is the Softmax output. Added two new QK paths: - No mask: `["Where", "Softmax", "MatMul"]` with `[0, 2, 0]` - With mask: `["Where", "Softmax", "Add", "MatMul"]` with `[0, 2, 0, 0]` 3. **Q and K scaling paths** Instead of a single combined scale on the QK MatMul output, SDPA applies separate `Mul(1/sqrt(head_dim))` to Q and K before the QK MatMul. Added: - Q path: `["Mul", "Transpose", "Reshape", "Add", "MatMul"]` with `[0, 0, 0, 0, None]` - K path: `["Mul", "Reshape", "Transpose", "Reshape", "Transpose", "Reshape", "Add", "MatMul"]` with `[1, 0, 0, 0, 0, 0, 0, None]` (K^T uses a `Reshape→Transpose(0,2,1)→Reshape` chain) 4. **num_heads fallback for dynamic shapes** SDPA models use `-1` in reshape shape tensors for dynamic dimensions, causing `get_num_heads_and_hidden_size` to return negative values. Added a fallback to user-specified `num_heads`/`hidden_size` when detected values are invalid. ### `bart_model_generator.py` (new) Synthetic BART SDPA attention graph generator that builds a minimal but complete attention subgraph matching the SDPA topology. Tests both `with_mask=True` (decoder self-attention) and `with_mask=False` (encoder attention) variants. ### `test_attention_fusion.py` Added `test_bart_attention_sdpa_fusion` that verifies: - 1 Attention node is produced for each mask variant - Correct `num_heads` attribute - Correct `unidirectional` attribute (1 for decoder self-attention with mask, 0 for encoder) ## Test Plan - [x] `python -m pytest test_attention_fusion.py -v` — all 10 tests pass - [x] `lintrunner` on all 3 changed files — no issues - [x] Verified on real exported BART SDPA model (`hf-internal-testing/tiny-random-bart`): 2 Attention nodes fused, graph reduced from 120 → 34 nodes
…osoft#27428) Replace and reland microsoft#27129 Comparison between this PR approach and inline in softmax ## Tradeoffs | Category | Pre-conversion (current) | Inline in softmax | | :--- | :--- | :--- | | **Memory** | Extra buffer ($num\_elements \times sizeof(T)$) | None — reads 1-byte bool directly | | **Kernel launches** | +1 simple elementwise kernel | Zero extra | | **Code complexity** | 3 files, ~40 lines added | 6+ kernel templates, macros, dispatch logic, data structs | | **Risk** | Low — softmax path untested | High — modifying battle-tested softmax kernels used by MHA + GQA contrib ops | | **Perf impact** | Negligible — mask is small vs. QKV; conversion is memory-bound and fast | Slightly better theoretical bandwidth | | **Maintainability** | Clean separation of concerns | Adds template dimension across all softmax variants | --- This pull request enhances the ONNX Runtime CUDA Attention operator to support boolean attention masks (bool masks) in the Multi-Head Attention (MHA) path, converting them to additive attention bias on the GPU. It also improves test coverage to ensure correctness and parity with the CPU implementation. The main changes include implementing a CUDA kernel for mask conversion, updating the operator logic to handle bool masks, clarifying broadcasting rules, and adding comprehensive unit tests. **CUDA Attention Operator Improvements:** * Implemented a CUDA kernel (`LaunchConvertBoolMaskToAttentionBias`) that converts boolean attention masks to additive bias (True → 0.0, False → mask_filter_value) for the MHA path, ensuring efficient GPU execution. [[1]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49R148-R187) [[2]](diffhunk://#diff-8aa9a15a92d7dc138346dce5de055911895d940ba2183b4ba45bd95ac0e5bfc9R55-R66) * Updated `attention.cc` to use this kernel, correctly handle bool masks in the MHA path, and clarified the broadcasting logic and mask shape interpretation for both GQA and MHA. [[1]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffR6) [[2]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffR380-R383) [[3]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL514-L522) [[4]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL549-R557) [[5]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffR595-R616) **Testing and Documentation Enhancements:** * Added new test cases and a dedicated test class to validate the correctness of boolean mask handling in the MHA path, ensuring parity with the CPU implementation for 2D, 3D, and 4D mask shapes. [[1]](diffhunk://#diff-801fbbcf2537e8e13a0202e6a0f7e88c56ab5aa72d17d949a5556355694b2b2dR563-R725) [[2]](diffhunk://#diff-801fbbcf2537e8e13a0202e6a0f7e88c56ab5aa72d17d949a5556355694b2b2dR893-R922) * Improved comments and documentation in both code and tests to clarify ONNX broadcasting rules and mask shape expectations for different attention paths. [[1]](diffhunk://#diff-4ed1461afda0d3804a61ba95a64b2a84d0c1395f9c887d1a3fdfed914ade22c1L208-R221) [[2]](diffhunk://#diff-801fbbcf2537e8e13a0202e6a0f7e88c56ab5aa72d17d949a5556355694b2b2dR35) **Test Coverage and Reliability:** * Enabled CUDA-based tests for boolean mask scenarios previously only tested on CPU, and adjusted test logic to ensure correct handling of edge cases (e.g., all-false masks). [[1]](diffhunk://#diff-3ff6dfa2ce407ae0073009174c37d1756509e8bbc434dee7c44cd55a996bb777L477-R480) [[2]](diffhunk://#diff-3ff6dfa2ce407ae0073009174c37d1756509e8bbc434dee7c44cd55a996bb777L620-R623) These changes make the CUDA Attention operator more robust and feature-complete, aligning its behavior with the CPU implementation and ONNX specifications.
…icrosoft#27470) Heap-allocate `WebGpuContextFactory::contexts_` to avoid crashes during static destruction when dependent DLLs (e.g. dxcompiler.dll) have already been unloaded. `Cleanup()` explicitly deletes the map; on abnormal termination it safely leaks. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…27480) ### Description ### Motivation and Context Resolves ongoing security issues with auditing the jar testing pipeline in the backend.
…7478) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
…icrosoft#27474) ### Description Improves the error message returned by ORT when loading a compiled model in a session that does not have the required execution provider(s). For example, ORT now returns the following error when creating session with a model compiled explicitly for OpenVINO EP without adding OpenVINO EP to the session: > EPContext node generated by 'OpenVINOExecutionProvider' is not compatible with any execution provider added to the session. EPContext node name: 'EPContextNode0'. Available session execution providers: [CPUExecutionProvider]. Compare the above message with the more generic message that is currently returned: >Could not find an implementation for EPContext(1) node with name 'EPContextNode0' ### Motivation and Context Improves diagnosability when loading of pre-compiled models fails. Specifically, the ambiguity of the original message led to many hours spent debugging an error where a compiled model failed to run because the expected `OrtEpDevice` was inadvertently not added to a session.
… compatibility (microsoft#27484) This pull request refactors validation logic for CUDA attention masks and tensor scatter operations to move error checking from host-side (CPU) to device-side (GPU) using CUDA kernel assertions (`CUDA_KERNEL_ASSERT`). This change eliminates synchronous host-device memory transfers and stream synchronizations, improving performance and simplifying code. Corresponding test cases are updated to only expect validation failures on the CPU, as CUDA errors are now asynchronous. Key changes: **Attention mask validation (GQA path):** - Removes host-side validation and memory copies for boolean attention masks in `attention.cc`; mask validity (right-padding, contiguous True/False) is now checked asynchronously via `CUDA_KERNEL_ASSERT` in the CUDA kernel. [[1]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL385-L387) [[2]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL414-L418) [[3]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL427-L448) - Updates the CUDA kernel and its interface to drop the `validation_result` buffer and rely on device assertions for mask validation. Documentation is updated to reflect this asynchronous error checking. [[1]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L10-R17) [[2]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L34) [[3]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L81-R76) [[4]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L104-R92) [[5]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L118) [[6]](diffhunk://#diff-00f7d49ccee44f1573357c07633bd03f21b9c2e1b1617c7a6a878a79ee6a6e49L137) [[7]](diffhunk://#diff-8aa9a15a92d7dc138346dce5de055911895d940ba2183b4ba45bd95ac0e5bfc9L37-L45) **TensorScatter write_indices validation:** - Removes host-side validation and synchronization for `write_indices` in `tensorscatter.cc`; index bounds checking is now performed asynchronously inside the CUDA kernel via `CUDA_KERNEL_ASSERT`. [[1]](diffhunk://#diff-d69233ff3987fe3093132a31710b6b64cc0a32140e2a5a415a2f1f0907bd22d2L75-R76) [[2]](diffhunk://#diff-1694a04b8ba9963cc06d651ec6a3be8aa9cb2bcb73c2438dc251ca8cdcb2eb41L31-R37) **Test updates:** - Updates negative test cases for `TensorScatter` to run only on CPU, since CUDA now validates asynchronously and will not synchronously return errors to the host. [[1]](diffhunk://#diff-8c90e642cc0cf4e68b2f3d4e4b3f1e21bf6d07f01663d424bc52c75ad0db2dfeR300) [[2]](diffhunk://#diff-8c90e642cc0cf4e68b2f3d4e4b3f1e21bf6d07f01663d424bc52c75ad0db2dfeL311-R319) [[3]](diffhunk://#diff-8c90e642cc0cf4e68b2f3d4e4b3f1e21bf6d07f01663d424bc52c75ad0db2dfeL327-R339) [[4]](diffhunk://#diff-8c90e642cc0cf4e68b2f3d4e4b3f1e21bf6d07f01663d424bc52c75ad0db2dfeL342-R354)
…crosoft#27136) ### Description Introduces a backend kernel selector config struct in MLAS that allows users to configure selection of backend kernels at runtime based on their preference. The immediate use-case of such a feature is to allow users to opt-out of using/selecting KleidiAI kernels should they choose to do so on ARM platforms. This solution should scale to other kernel implementation backends in the future. ### Motivation and Context Allow users to opt-out of using/selecting KleidiAI kernels should they choose to do so on ARM platforms --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description This PR adds a few headers for supporting building WebGPU EP and CUDA EP as plugin EPs. See summary of microsoft#26907
…lation… (microsoft#27359) ### Description This change optimises the GridSample operator in onnxrt. 1- For the GridSample nodes having similar characteristics the camera based 3D object detection model in MLPerf Automotive space, transforming the input to output coordinates with : 2D interpolation mode = linear with padding mode = zeros, align_corners = 0, fast path added. Linear interpolation: For each (x, y), the code locates the four surrounding integer pixel centers: (x1, y1) = (floor(x), floor(y)) (top-left) (x2, y2) = (x1 + 1, y1 + 1) (bottom-right) The interpolation weights reflect the fractional positions: dx1 = x - x1, dx2 = x2 - x dy1 = y - y1, dy2 = y2 - y The resulting value is the bilinear blend dy2 * (dx2 * p11 + dx1 * p12) + dy1 * (dx2 * p21 + dx1 * p22) where p11…p22 are the input pixels at those four neighbor coordinates. Padding mode = zeros: Any neighbor index that falls outside [0, W_in-1] × [0, H_in-1] contributes 0 to the interpolation. Each output pixel (oy, ox) carries normalized coordinates (nx, ny) in [-1, 1]. With align_corners=0, nx = -1 corresponds to a location half a pixel before the leftmost input column (i.e., x = -0.5), and nx = 1 corresponds to half a pixel beyond the rightmost column (x = W_in - 0.5). Same idea vertically for ny. Fast Path Optimisation : The implementation can precompute all neighbor indices/weights for each output pixel once (they depend only on the grid), then reuse them for every channel. Previously indices and weights were calculated inside the loops which can be as much as (H_out*W_out like 20,000 per batch element in one case) x 32 Channels. 2-optional ARM NEON vectorization added : - vld1_f32(ptr): loads two contiguous float values into a float32x2_t. Used to read the top and bottom neighbor pairs ([p11, p12], [p21, p22]). - vcombine_f32(low, high): concatenates two float32x2_t values into one float32x4_t, giving [p11, p12, p21, p22]. - vdup_n_f32(val): duplicates a scalar float into both lanes of a float32x2_t. - vset_lane_f32(val, vec, lane): writes val into the specified lane of a float32x2_t, letting us form [w11, w12] and [w21, w22]. - vmulq_f32(a, b): multiplies two float32x4_t vectors element-wise (neighbor pixels × weights). - vget_low_f32(vec) / vget_high_f32(vec): extract the lower or upper 2 lanes from a float32x4_t as float32x2_t. - vadd_f32(a, b): adds two float32x2_t vectors element-wise (forming partial sums). - vpadd_f32(a, b): performs pairwise adds within and across two float32x2_t vectors, collapsing four elements down to two. - vget_lane_f32(vec, lane): reads a scalar from a specific lane, giving the final interpolated value. Most of the performance uplift coming from the 1st optimisation. 2nd optimisation using NEON intrinsics still contributes but not as much as the 1st one. Overall performance improvement : 1 thread : <img width="902" height="766" alt="image" src="https://github.com/user-attachments/assets/d1fadc6d-370d-4750-baee-1123c7d18af3" /> 2 threads: <img width="902" height="766" alt="image" src="https://github.com/user-attachments/assets/69c86fd6-815a-4b52-8f86-615f1c99bf0a" /> ### Motivation and Context The fast path handles denormalisation of the linear coordinates and can handle the derivation of the indices by precomputing a separate plan entry per output pixel. In PrecomputeBilinearSamplePlan2D, the loop runs across all H_out * W_out points, using the right nx/ny for each (oy, ox) and storing that point’s four indices, four weights, and mask in plans[idx]. During evaluation, EvaluatePlanForChannel iterates through the same point_count(H_out*W_out) and uses the matching plan entry for each (oy, ox). So we are not reusing one plan across different spatial positions; we precompute one plan per output location and reuse it only across channels (which share the same grid). --------- Signed-off-by: melkap01 <melike.kaptan@arm.com> Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
microsoft#27483) ### Description This PR reduces the size of the memory allocation for expected outputs from ~4GiB to ~2GiB in the Gather_overflow_check test. The updated test still verifies that the integer overflow fix from PR microsoft#27444 is valid. That is, that the CPU Gather operator correctly handles output tensors with element counts that exceed INT32_MAX. Changes: - Reduced test dimension from 65537 to 46341 (output shape from 65537×65537 to 46341×46341), which results in a total number of elements that is just over INT32_MAX, which is required to test the bug fix. - The peak memory usage is reduced to ~4GiB + overhead. - Increase Android emulator memory to 5GiB (from 4GiB) to be able to run the test. ### Motivation Android CI fails to run the unit test introduced in microsoft#27444 due to memory usage that exceeds the Android emulator's default memory of 4GiB. This PR lowers the peak memory usage of the unit test and increases the Android emulator's memory by 1GiB.
…() (microsoft#27295) Remove unnecessary s_kernel_registry_vitisaiep.reset() call in deinitialize_vitisai_ep() function. The kernel registry will be repopulated on next initialization, making this reset redundant.
…hains (microsoft#27391) ### Description Profiling shows that CheckIfSubtreesAreEqual is invoked recursively for many node pairs for LightGBM models with categorical features. A significant portion of this work consists of self-comparisons (left_id == right_id), leading to effectively O(n²) comparing trees to themselves during model loading. This change adds a fast-path for trivial equality, avoiding unnecessary recursive comparisons. Example results: - model with 7K BRANCH_EQ nodes: 527 ms → 47 ms (~11× faster) - model with 106K BRANCH_EQ nodes: 141 s → 80 ms (~1760× faster) ### Motivation and Context We have some LightGBM exported models that make heavy use of categorical features and exhibit extremely slow load times (minutes for a single 2.5mb model). Heres a diagram to illustrate the issue: <img width="1008" height="1229" alt="image" src="https://github.com/user-attachments/assets/348e16cb-9eec-448f-ac5c-e1edb60e2a3d" /> the 106K model has much longer "member of" chains, with chains that lead into more chains: <details> <summary>"trees"</summary> <img width="1405" height="593" alt="image" src="https://github.com/user-attachments/assets/12f0c43f-5987-4b33-9001-2a2b526e537f" /> </details> Interestingly we did also try using the new onnx.ml opset 5 node that has MEMBER, but it seems even slower as it recreates these branch EQ chains.
) ### Description Add support for pre-layer normalization (pre-LN) transformer architectures in the Python attention fusion optimizer. ### Motivation and Context Fixes microsoft#11684 Pre-LN models (used in GPT-3, ViT variants, and many modern architectures) apply LayerNormalization **before** attention rather than after. The first block of a pre-LN model has no `Add` node before its first `LayerNormalization` — its input comes directly from a graph input. This caused `FusionAttention.fuse()` to bail out early because it assumed every `LayerNormalization` anchor has an `Add` parent (the residual connection from the previous block). This PR makes four surgical changes to `fusion_attention.py` so that pre-LN first-block models fuse correctly, while preserving all existing post-LN behavior: 1. **Allow LN with graph-input parent** — instead of returning early when no `Add` parent is found, check whether the input is a graph input and continue 2. **Include graph inputs in residual collection** — the `other_inputs` loop previously skipped anything not in `output_name_to_node`; graph inputs are now recognized 3. **Extend child-LN resolution to SkipLN anchors** — after `fuse_skip_layer_norm()` runs, the anchor becomes `SkipLayerNormalization`; the redirect from `root_input` (graph input) to the first LN's output now fires for SkipLN anchors too 4. **Guard `output_name_to_node` lookup** — graph inputs are not node outputs, so the dictionary access is now guarded ### Changes - `onnxruntime/python/tools/transformers/fusion_attention.py` — 4 targeted edits to `FusionAttention.fuse()` - `onnxruntime/test/python/transformers/bert_model_generator.py` — new `create_bert_attention_pre_ln()` test graph generator - `onnxruntime/test/python/transformers/test_attention_fusion.py` — new `test_attention_fusion_pre_ln()` test ### Test Plan - [x] New unit test `test_attention_fusion_pre_ln` passes — verifies `Attention` fused op appears in the optimized graph - [x] Lintrunner passes on all changed files (no lint issues) - [x] Changes are minimal and scoped to the pre-LN first-block gap
ankitm3k
approved these changes
Mar 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Synchronizing intel/onnxruntime ovep-develop branch with latest changes from microsoft/onnxruntime master branch.