[SPARK-56047][SQL] Propagate distinctCount through Union in CBO statistics estimation#54883
Draft
LuciferYang wants to merge 3 commits intoapache:masterfrom
Draft
[SPARK-56047][SQL] Propagate distinctCount through Union in CBO statistics estimation#54883LuciferYang wants to merge 3 commits intoapache:masterfrom
LuciferYang wants to merge 3 commits intoapache:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
UnionEstimationcurrently propagatesmin,max, andnullCountcolumn statistics, but does not propagatedistinctCount. This meansColumnStat.hasCountStats(which requires bothdistinctCount.isDefinedandnullCount.isDefined) always returnsfalsefor Union output columns, so downstream operators likeAggregateEstimationcannot leverage CBO and fall back toSizeInBytesOnlyStatsPlanVisitor.This PR extends
UnionEstimationto also propagatedistinctCount, enabling downstream operators (AggregateEstimation,JoinEstimation, etc.) that depend onhasCountStatsto produce accurate CBO estimates for queries overUNION ALL, improving physical plan selection (join ordering, join strategy, etc.).Main changes of (
UnionEstimation.scala):computeDistinctCountStats(union, outputRows)method that follows the same structure as the existingcomputeNullCountStats: it iterates transposed children output attributes, only computes when all children havedistinctCountfor a given column, takesmax(distinctCount)across children, and caps the result byoutputRowswhen available.estimate()to callcomputeDistinctCountStatsand layer distinctCount on top of the merged min/max + nullCount stats (three-layer merge).For UNION ALL, the true
distinctCountsatisfies:max(dc_i) <= true_dc <= min(sum(dc_i), rowCount). We usemaxas the estimate because UNION ALL branches typically share overlapping values (e.g.,web_salesandcatalog_salesreference the same date dimension keys), makingmaxa reasonable approximation. Even in the worst case of completely disjoint values,maxunderestimates by at most N× (N = number of branches, typically 2–3), which is far better than losing CBO entirely.Why are the changes needed?
For queries over
UNION ALLwith CBO enabled, the missingdistinctCounton Union output causes downstream estimators (e.g.,AggregateEstimation,JoinEstimation) that checkhasCountStatsto fall back toSizeInBytesOnlyStatsPlanVisitor, producing inflatedsizeInBytesestimates. This can lead the optimizer to choose suboptimal join ordering and join strategies.For example, in TPC-DS sf100:
Does this PR introduce any user-facing change?
No. This is an internal improvement to the CBO statistics estimation. Query results are unchanged; only physical plan selection (join ordering, join strategy, etc.) may differ due to more accurate cardinality estimates.
How was this patch tested?
UnionEstimationSuite(3 tests) andBasicStatsEstimationSuite(1 test) to include the now-propagateddistinctCountin expectedColumnStatvalues.UnionEstimationSuite:SPARK-56047: distinctCount propagated as max across children: verifies max(100, 200) = 200.SPARK-56047: distinctCount omitted when one child lacks it: verifies distinctCount is None when not all children have it.SPARK-56047: distinctCount capped by rowCount: verifies min(max(500, 300), 6) = 6.SPARK-56047: hasCountStats is true when both distinctCount and nullCount propagated: verifies end-to-endhasCountStatscorrectness.UnionAggregateEstimationSuite(3 tests) verifying thatAggregateEstimationproduces accurate CBO estimates (correctrowCount, smallsizeInBytes) when operating on Union output, compared against a control group without Union.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Sonnet 4.6.