[SPARK-55848][SQL][4.0] Fix incorrect dedup results with SPJ partial clustering#54851
[SPARK-55848][SQL][4.0] Fix incorrect dedup results with SPJ partial clustering#54851naveenp2708 wants to merge 2 commits intoapache:branch-4.0from
Conversation
|
@peter-toth Backport to branch-4.0 as requested. Same fix as #54751. Branch-3.5 PR coming next. |
| required match { | ||
| case c @ ClusteredDistribution(requiredClustering, requireAllClusterKeys, _) => | ||
| if (requireAllClusterKeys) { | ||
| // Checks whether this partitioning is partitioned on exactly same clustering keys of |
There was a problem hiding this comment.
I just noticed that you deleted 3 comments, but this change should be just a reorder of conditions. Can you please put those comments back?
Also, it seems those comments were deleted in #54751 as well, can you please restore them on branch-4.1 in a follow-up PR?
peter-toth
left a comment
There was a problem hiding this comment.
LGTM, just minor request.
…clustering When SPJ partial clustering splits a partition across multiple tasks, post-join dedup operators (dropDuplicates, Window row_number) produce incorrect results because KeyGroupedPartitioning.satisfies0() incorrectly reports satisfaction of ClusteredDistribution via super.satisfies0() short-circuiting the isPartiallyClustered guard. This fix adds an isPartiallyClustered flag to KeyGroupedPartitioning and restructures satisfies0() to check ClusteredDistribution first, returning false when partially clustered. EnsureRequirements then inserts the necessary Exchange. Plain SPJ joins without dedup are unaffected. Closes apache#54378
bbd585d to
ef11958
Compare
|
@peter-toth Thanks for catching that! Restored all 3 deleted comments — the change is now a reorder only. Will open a follow-up PR for branch-4.1 to restore the comments there as well. |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
+1, LGTM (Pending CIs).
|
@naveenp2708, can you please regtrigger this one as well? |
…clustering ### What changes were proposed in this pull request? Backport fix for SPARK-55848 to branch-3.5. Same fix as merged in branch-4.1 via #54751 and branch-4.0 via #54851. The fix adds an `isPartiallyClustered` flag to `KeyGroupedPartitioning` and restructures `satisfies0()` to check `ClusteredDistribution` first, returning `false` when partially clustered. `EnsureRequirements` then inserts the necessary Exchange. ### Why are the changes needed? SPJ with partial clustering produces incorrect results for post-join dedup operations (dropDuplicates, Window row_number). ### Does this PR introduce any user-facing change? Yes. Queries using SPJ with partial clustering followed by dedup operations will now return correct results. ### How was this patch tested? Three regression tests added. All 53 tests pass locally. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54852 from naveenp2708/spark-55848-fix-branch-3.5. Authored-by: Naveen Kumar Puppala <naveenp2708@gmail.com> Signed-off-by: Peter Toth <peter.toth@gmail.com>
|
@peter-toth Retriggered. Thanks! |
### What changes were proposed in this pull request? Follow-up to #54751. Restores 3 comments in `satisfies0()` that were accidentally deleted during the restructuring. Comments only, no logic changes. ### Why are the changes needed? Per peter-toth's review in #54851, the original comments should be preserved. ### Does this PR introduce any user-facing change? No. Comments only. ### How was this patch tested? Compilation verified. No logic changes. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54875 from naveenp2708/spark-55848-restore-comments-4.1. Authored-by: Naveen Kumar Puppala <naveenp2708@gmail.com> Signed-off-by: Peter Toth <peter.toth@gmail.com>
|
Hi, @naveenp2708 . Could you double-check the CI failures? We cannot merge your PR if there is a failure. |
What changes were proposed in this pull request?
Backport fix for SPARK-55848 to branch-4.0. Same fix as merged in branch-4.1 via #54751.
The fix adds an
isPartiallyClusteredflag toKeyGroupedPartitioningand restructuressatisfies0()to checkClusteredDistributionfirst, returningfalsewhen partially clustered.EnsureRequirementsthen inserts the necessary Exchange.Why are the changes needed?
SPJ with partial clustering produces incorrect results for post-join dedup operations (dropDuplicates, Window row_number).
Does this PR introduce any user-facing change?
Yes. Queries using SPJ with partial clustering followed by dedup operations will now return correct results.
How was this patch tested?
Three regression tests added. All tests pass locally.
Was this patch authored or co-authored using generative AI tooling?
No.