[SPARK-56047][SQL] Propagate distinctCount through Union in CBO statistics estimation by LuciferYang · Pull Request #54883 · apache/spark

LuciferYang · 2026-03-18T12:17:43Z

What changes were proposed in this pull request?

UnionEstimation currently propagates min, max, and nullCount column statistics, but does not propagate distinctCount. This means ColumnStat.hasCountStats (which requires both distinctCount.isDefined and nullCount.isDefined) always returns false for Union output columns, so downstream operators like AggregateEstimation cannot leverage CBO and fall back to SizeInBytesOnlyStatsPlanVisitor.

This PR extends UnionEstimation to also propagate distinctCount, enabling downstream operators (AggregateEstimation, JoinEstimation, etc.) that depend on hasCountStats to produce accurate CBO estimates for queries over UNION ALL, improving physical plan selection (join ordering, join strategy, etc.).

Main changes of (UnionEstimation.scala):

Added computeDistinctCountStats(union, outputRows) method that follows the same structure as the existing computeNullCountStats: it iterates transposed children output attributes, only computes when all children have distinctCount for a given column, takes max(distinctCount) across children, and caps the result by outputRows when available.
Modified estimate() to call computeDistinctCountStats and layer distinctCount on top of the merged min/max + nullCount stats (three-layer merge).

For UNION ALL, the true distinctCount satisfies: max(dc_i) <= true_dc <= min(sum(dc_i), rowCount). We use max as the estimate because UNION ALL branches typically share overlapping values (e.g., web_sales and catalog_sales reference the same date dimension keys), making max a reasonable approximation. Even in the worst case of completely disjoint values, max underestimates by at most N× (N = number of branches, typically 2–3), which is far better than losing CBO entirely.

Why are the changes needed?

For queries over UNION ALL with CBO enabled, the missing distinctCount on Union output causes downstream estimators (e.g., AggregateEstimation, JoinEstimation) that check hasCountStats to fall back to SizeInBytesOnlyStatsPlanVisitor, producing inflated sizeInBytes estimates. This can lead the optimizer to choose suboptimal join ordering and join strategies.

For example, in TPC-DS sf100:

q5/q5a: More accurate cardinality estimates cause the optimizer to reorder the date_dim join and the dimension table join (store/catalog_page/web_site).
q54: More accurate estimates allow the optimizer to recognize that a Union-derived subquery result is small enough to broadcast, converting 2 SortMergeJoins to BroadcastHashJoins and eliminating 4 Sort + 2 Exchange operators (16→13 codegen stages).

Does this PR introduce any user-facing change?

No. This is an internal improvement to the CBO statistics estimation. Query results are unchanged; only physical plan selection (join ordering, join strategy, etc.) may differ due to more accurate cardinality estimates.

How was this patch tested?

1. Updated 4 existing tests in UnionEstimationSuite (3 tests) and BasicStatsEstimationSuite (1 test) to include the now-propagated distinctCount in expected ColumnStat values.
Added 4 new unit tests in UnionEstimationSuite:
- SPARK-56047: distinctCount propagated as max across children: verifies max(100, 200) = 200.
- SPARK-56047: distinctCount omitted when one child lacks it: verifies distinctCount is None when not all children have it.
- SPARK-56047: distinctCount capped by rowCount: verifies min(max(500, 300), 6) = 6.
- SPARK-56047: hasCountStats is true when both distinctCount and nullCount propagated: verifies end-to-end hasCountStats correctness.
Added UnionAggregateEstimationSuite (3 tests) verifying that AggregateEstimation produces accurate CBO estimates (correct rowCount, small sizeInBytes) when operating on Union output, compared against a control group without Union.
Regenerated 3 TPC-DS plan stability golden files (q5.sf100, q54.sf100, q5a.sf100) that reflect improved physical plan selection (join reordering in q5/q5a, SortMergeJoin→BroadcastHashJoin in q54) due to more accurate cardinality estimates after Union.
Pass Github Actions.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6.

LuciferYang added 2 commits March 18, 2026 20:10

init

6486efe

test name

9d2e3f4

LuciferYang marked this pull request as draft March 18, 2026 12:17

fix one test

b10e2df

LuciferYang marked this pull request as ready for review March 18, 2026 18:05

LuciferYang marked this pull request as draft March 18, 2026 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56047][SQL] Propagate distinctCount through Union in CBO statistics estimation#54883

[SPARK-56047][SQL] Propagate distinctCount through Union in CBO statistics estimation#54883
LuciferYang wants to merge 3 commits intoapache:masterfrom
LuciferYang:SPARK-56047

LuciferYang commented Mar 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LuciferYang commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LuciferYang commented Mar 18, 2026 •

edited

Loading