Skip to content

[SPARK-56047][SQL] Propagate distinctCount through Union in CBO statistics estimation#54883

Draft
LuciferYang wants to merge 3 commits intoapache:masterfrom
LuciferYang:SPARK-56047
Draft

[SPARK-56047][SQL] Propagate distinctCount through Union in CBO statistics estimation#54883
LuciferYang wants to merge 3 commits intoapache:masterfrom
LuciferYang:SPARK-56047

Conversation

@LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Mar 18, 2026

What changes were proposed in this pull request?

UnionEstimation currently propagates min, max, and nullCount column statistics, but does not propagate distinctCount. This means ColumnStat.hasCountStats (which requires both distinctCount.isDefined and nullCount.isDefined) always returns false for Union output columns, so downstream operators like AggregateEstimation cannot leverage CBO and fall back to SizeInBytesOnlyStatsPlanVisitor.

This PR extends UnionEstimation to also propagate distinctCount, enabling downstream operators (AggregateEstimation, JoinEstimation, etc.) that depend on hasCountStats to produce accurate CBO estimates for queries over UNION ALL, improving physical plan selection (join ordering, join strategy, etc.).

Main changes of (UnionEstimation.scala):

  • Added computeDistinctCountStats(union, outputRows) method that follows the same structure as the existing computeNullCountStats: it iterates transposed children output attributes, only computes when all children have distinctCount for a given column, takes max(distinctCount) across children, and caps the result by outputRows when available.
  • Modified estimate() to call computeDistinctCountStats and layer distinctCount on top of the merged min/max + nullCount stats (three-layer merge).

For UNION ALL, the true distinctCount satisfies: max(dc_i) <= true_dc <= min(sum(dc_i), rowCount). We use max as the estimate because UNION ALL branches typically share overlapping values (e.g., web_sales and catalog_sales reference the same date dimension keys), making max a reasonable approximation. Even in the worst case of completely disjoint values, max underestimates by at most N× (N = number of branches, typically 2–3), which is far better than losing CBO entirely.

Why are the changes needed?

For queries over UNION ALL with CBO enabled, the missing distinctCount on Union output causes downstream estimators (e.g., AggregateEstimation, JoinEstimation) that check hasCountStats to fall back to SizeInBytesOnlyStatsPlanVisitor, producing inflated sizeInBytes estimates. This can lead the optimizer to choose suboptimal join ordering and join strategies.

For example, in TPC-DS sf100:

  • q5/q5a: More accurate cardinality estimates cause the optimizer to reorder the date_dim join and the dimension table join (store/catalog_page/web_site).
  • q54: More accurate estimates allow the optimizer to recognize that a Union-derived subquery result is small enough to broadcast, converting 2 SortMergeJoins to BroadcastHashJoins and eliminating 4 Sort + 2 Exchange operators (16→13 codegen stages).

Does this PR introduce any user-facing change?

No. This is an internal improvement to the CBO statistics estimation. Query results are unchanged; only physical plan selection (join ordering, join strategy, etc.) may differ due to more accurate cardinality estimates.

How was this patch tested?

    1. Updated 4 existing tests in UnionEstimationSuite (3 tests) and BasicStatsEstimationSuite (1 test) to include the now-propagated distinctCount in expected ColumnStat values.
  1. Added 4 new unit tests in UnionEstimationSuite:
    • SPARK-56047: distinctCount propagated as max across children: verifies max(100, 200) = 200.
    • SPARK-56047: distinctCount omitted when one child lacks it: verifies distinctCount is None when not all children have it.
    • SPARK-56047: distinctCount capped by rowCount: verifies min(max(500, 300), 6) = 6.
    • SPARK-56047: hasCountStats is true when both distinctCount and nullCount propagated: verifies end-to-end hasCountStats correctness.
  2. Added UnionAggregateEstimationSuite (3 tests) verifying that AggregateEstimation produces accurate CBO estimates (correct rowCount, small sizeInBytes) when operating on Union output, compared against a control group without Union.
  3. Regenerated 3 TPC-DS plan stability golden files (q5.sf100, q54.sf100, q5a.sf100) that reflect improved physical plan selection (join reordering in q5/q5a, SortMergeJoin→BroadcastHashJoin in q54) due to more accurate cardinality estimates after Union.
  4. Pass Github Actions.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6.

@LuciferYang LuciferYang marked this pull request as draft March 18, 2026 12:17
@LuciferYang LuciferYang marked this pull request as ready for review March 18, 2026 18:05
@LuciferYang LuciferYang marked this pull request as draft March 18, 2026 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant