[SPARK-55848][SQL][TESTS] Add regression tests for SPJ partial clustering dedup#54714
[SPARK-55848][SQL][TESTS] Add regression tests for SPJ partial clustering dedup#54714naveenp2708 wants to merge 5 commits intoapache:masterfrom
Conversation
…dup correctness Add regression tests verifying that dropDuplicates, Window dedup, and checkpointed scans produce correct results when SPJ partial clustering is active. These tests cover the correctness issue fixed by SPARK-55848. Closes apache#54378
|
@peter-toth Per your suggestion — tests-only PR for master. Branch-4.1 PR with fix + tests coming next. |
|
Seems like check didn't run on this PR yet, most likely because you haven't enabled Github Actions in your forked repo. |
| |FROM testcat.ns.$items i | ||
| |JOIN testcat.ns.$purchases p ON i.id = p.item_id | ||
| |""".stripMargin) | ||
| checkAnswer(df, Seq(Row(1), Row(2), Row(3))) |
There was a problem hiding this comment.
Can you please check if we have the expected GroupPartitionExecs in the plan? I believe one should be under the join to align partitions to the other, partially clustered side of the join; and one should be above the join to group partitions for the aggregate.
Also please assert that we don't have any exchanges in the plan.
| // Checkpoint the items scan, then join with purchases under partial clustering, | ||
| // and finally dedup. The checkpointed scan preserves the KeyedPartitioning | ||
| // with non-grouped keys; the dedup must still produce correct distinct ids. | ||
| val itemsDf = spark.read.table(s"testcat.ns.$items").checkpoint() |
There was a problem hiding this comment.
I mean in this test we should checkpoint the partially clustered result of a join and put an aggregate on top of that checkpoint. The aggregate should require a GroupPartitionsExec on master or a shuffle on previous branches (before the refactor).
fb9cadc to
1b1b1cb
Compare
|
@peter-toth Added GroupPartitionsExec assertions to all 3 tests:
|
|
@peter-toth Enabled the workflows on my fork — "Build" and "Report test results" both passed. |
|
@naveenp2708, can you please take a look at #54751 (comment) and adjust the test case there and here as well? Actually, probably we should agree on the fix in #54751 first and then update this PR. |
| } | ||
| } | ||
|
|
||
| test("SPARK-55848: Window dedup after SPJ with partial clustering should produce " + |
There was a problem hiding this comment.
nit: 'should produce correct results' is redundant :)
| // SPJ. The WINDOW operator requires ClusteredDistribution on i.id; with partial | ||
| // clustering the plan must insert the right exchange/group so that the window | ||
| // produces exactly one row per id. | ||
| val df = sql( |
There was a problem hiding this comment.
should we use 'selectWithMergeJoinHint' for consistency?
1b1b1cb to
ad6cf20
Compare
|
@szehon-ho Fixed both — shortened test names and switched to selectWithMergeJoinHint. Thanks for the review! |
|
@peter-toth Makes sense — PR #54751 now has approvals from you and @szehon-ho. Once it's merged, I'll update this PR to align the tests. Already addressed szehon-ho's nits (shortened test names + selectWithMergeJoinHint). |
|
@naveenp2708 , I merged #54751, can you please update this PR? Also, let's add |
ad6cf20 to
2442dea
Compare
|
@peter-toth Updated title with [TESTS] tag. Tests aligned with merged #54751 — checkpointed test checkpoints join result, AQE disabled, GroupPartitionsExec assertions, shortened names, selectWithMergeJoinHint. Will open backport PRs for branch-4.0 and branch-3.5 next. |
|
@naveenp2708 , can you please double check the PR title? It looks incomplete and I don't see |
|
@peter-toth Fixed the title to include [TESTS] tag. Retriggered the checks. |
Thank you @naveenp2708. Could you please retrigger the tests again? |
|
@peter-toth Retriggered. Thanks for the approval! |
What changes were proposed in this pull request?
Test-only PR. Adds regression tests for SPARK-55848 (SPJ partial clustering produces incorrect results for post-join dedup operations).
Three tests added to KeyGroupedPartitioningSuite:
Why are the changes needed?
The fix was merged via #54330, but regression tests for the correctness issue (SPARK-55848 / #54378) were not included. These tests ensure the issue does not regress.
Does this PR introduce any user-facing change?
No. Test-only change.
How was this patch tested?
All 73 tests in KeyGroupedPartitioningSuite pass.
Was this patch authored or co-authored using generative AI tooling?
No.