Skip to content

[SPARK-55848][SQL][TESTS] Add regression tests for SPJ partial clustering dedup#54714

Open
naveenp2708 wants to merge 5 commits intoapache:masterfrom
naveenp2708:spark-55848-tests-master
Open

[SPARK-55848][SQL][TESTS] Add regression tests for SPJ partial clustering dedup#54714
naveenp2708 wants to merge 5 commits intoapache:masterfrom
naveenp2708:spark-55848-tests-master

Conversation

@naveenp2708
Copy link

What changes were proposed in this pull request?

Test-only PR. Adds regression tests for SPARK-55848 (SPJ partial clustering produces incorrect results for post-join dedup operations).

Three tests added to KeyGroupedPartitioningSuite:

  1. SPARK-55848: dropDuplicates after SPJ with partial clustering
  2. SPARK-55848: Window dedup after SPJ with partial clustering
  3. SPARK-55848: checkpointed scan with partial clustering and dedup

Why are the changes needed?

The fix was merged via #54330, but regression tests for the correctness issue (SPARK-55848 / #54378) were not included. These tests ensure the issue does not regress.

Does this PR introduce any user-facing change?

No. Test-only change.

How was this patch tested?

All 73 tests in KeyGroupedPartitioningSuite pass.

Was this patch authored or co-authored using generative AI tooling?

No.

…dup correctness

Add regression tests verifying that dropDuplicates, Window dedup, and
checkpointed scans produce correct results when SPJ partial clustering
is active. These tests cover the correctness issue fixed by SPARK-55848.

Closes apache#54378
@naveenp2708
Copy link
Author

@peter-toth Per your suggestion — tests-only PR for master. Branch-4.1 PR with fix + tests coming next.

@peter-toth
Copy link
Contributor

Seems like check didn't run on this PR yet, most likely because you haven't enabled Github Actions in your forked repo.
Can you please check the "Pull request" section at https://spark.apache.org/contributing.html? Or if you click on the pending check that also shows you how to enable/retrigger the necessary workflows.

|FROM testcat.ns.$items i
|JOIN testcat.ns.$purchases p ON i.id = p.item_id
|""".stripMargin)
checkAnswer(df, Seq(Row(1), Row(2), Row(3)))
Copy link
Contributor

@peter-toth peter-toth Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please check if we have the expected GroupPartitionExecs in the plan? I believe one should be under the join to align partitions to the other, partially clustered side of the join; and one should be above the join to group partitions for the aggregate.
Also please assert that we don't have any exchanges in the plan.

// Checkpoint the items scan, then join with purchases under partial clustering,
// and finally dedup. The checkpointed scan preserves the KeyedPartitioning
// with non-grouped keys; the dedup must still produce correct distinct ids.
val itemsDf = spark.read.table(s"testcat.ns.$items").checkpoint()
Copy link
Contributor

@peter-toth peter-toth Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean in this test we should checkpoint the partially clustered result of a join and put an aggregate on top of that checkpoint. The aggregate should require a GroupPartitionsExec on master or a shuffle on previous branches (before the refactor).

@naveenp2708 naveenp2708 force-pushed the spark-55848-tests-master branch from fb9cadc to 1b1b1cb Compare March 10, 2026 16:33
@naveenp2708
Copy link
Author

naveenp2708 commented Mar 10, 2026

@peter-toth Added GroupPartitionsExec assertions to all 3 tests:

  • 2 GroupPartitionsExec under the join (one per child to align partitions)
  • 1 GroupPartitionsExec above the join (for aggregate/window grouping)
  • Total of 3 GroupPartitionsExec per test plan
    enabled the workflows on my fork. CI should run now.

@naveenp2708
Copy link
Author

@peter-toth Enabled the workflows on my fork — "Build" and "Report test results" both passed.

@peter-toth
Copy link
Contributor

peter-toth commented Mar 11, 2026

@naveenp2708, can you please take a look at #54751 (comment) and adjust the test case there and here as well? Actually, probably we should agree on the fix in #54751 first and then update this PR.

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tests!

}
}

test("SPARK-55848: Window dedup after SPJ with partial clustering should produce " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 'should produce correct results' is redundant :)

// SPJ. The WINDOW operator requires ClusteredDistribution on i.id; with partial
// clustering the plan must insert the right exchange/group so that the window
// produces exactly one row per id.
val df = sql(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use 'selectWithMergeJoinHint' for consistency?

@naveenp2708 naveenp2708 force-pushed the spark-55848-tests-master branch from 1b1b1cb to ad6cf20 Compare March 13, 2026 02:28
@naveenp2708
Copy link
Author

naveenp2708 commented Mar 13, 2026

@szehon-ho Fixed both — shortened test names and switched to selectWithMergeJoinHint. Thanks for the review!

@naveenp2708
Copy link
Author

@peter-toth Makes sense — PR #54751 now has approvals from you and @szehon-ho. Once it's merged, I'll update this PR to align the tests. Already addressed szehon-ho's nits (shortened test names + selectWithMergeJoinHint).

@peter-toth
Copy link
Contributor

@naveenp2708 , I merged #54751, can you please update this PR? Also, let's add [TESTS] tag to this one.

@naveenp2708 naveenp2708 force-pushed the spark-55848-tests-master branch from ad6cf20 to 2442dea Compare March 17, 2026 03:52
@naveenp2708
Copy link
Author

@peter-toth Updated title with [TESTS] tag. Tests aligned with merged #54751 — checkpointed test checkpoints join result, AQE disabled, GroupPartitionsExec assertions, shortened names, selectWithMergeJoinHint. Will open backport PRs for branch-4.0 and branch-3.5 next.

@peter-toth
Copy link
Contributor

@naveenp2708 , can you please double check the PR title? It looks incomplete and I don't see [TESTS] there; and also retrigger the checks (the failure seems unrelated)?

@naveenp2708 naveenp2708 changed the title [SPARK-55848][SQL] Add regression tests for SPJ partial clustering de… [SPARK-55848][SQL][TESTS] Add regression tests for SPJ partial clustering dedup Mar 17, 2026
@naveenp2708
Copy link
Author

@peter-toth Fixed the title to include [TESTS] tag. Retriggered the checks.

@peter-toth
Copy link
Contributor

@peter-toth Fixed the title to include [TESTS] tag. Retriggered the checks.

Thank you @naveenp2708. Could you please retrigger the tests again?

@naveenp2708
Copy link
Author

@peter-toth Retriggered. Thanks for the approval!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants