Skip to content

feat: support datafusion integration#150

Merged
JingsongLi merged 3 commits intoapache:mainfrom
luoyuxia:introduce-datafusion-pr
Mar 24, 2026
Merged

feat: support datafusion integration#150
JingsongLi merged 3 commits intoapache:mainfrom
luoyuxia:introduce-datafusion-pr

Conversation

@luoyuxia
Copy link
Contributor

@luoyuxia luoyuxia commented Mar 22, 2026

Purpose

Linked issue: close #149

Add an initial read-only DataFusion integration for Paimon tables so users can register a Paimon table as a DataFusion TableProvider and query it with SQL/DataFrame APIs.

Brief change log

  • add a new paimon-datafusion integration crate
  • implement a read-only PaimonTableProvider and physical scan plan for DataFusion
  • convert Paimon schema types to Arrow/DataFusion schema types, including time, timestamp, local-zoned timestamp, and decimal precision mapping
  • add DataFusion integration tests for reading a log table and a DV-backed primary key table
  • wire the DataFusion integration test into CI's integration job

Tests

  • cargo test -p paimon-datafusion --test read_tables --no-run

API and Format

  • adds a new integration crate: crates/integrations/datafusion
  • no storage format changes
  • no changes to existing public APIs in paimon

Documentation

  • no additional documentation changes in this PR

@luoyuxia luoyuxia force-pushed the introduce-datafusion-pr branch 3 times, most recently from e59d251 to 7a4ebf5 Compare March 22, 2026 04:12
@luoyuxia luoyuxia force-pushed the introduce-datafusion-pr branch from 7a4ebf5 to 2138bc7 Compare March 22, 2026 04:18
pub(crate) fn new(schema: ArrowSchemaRef, table: Table) -> Self {
let plan_properties = PlanProperties::new(
EquivalenceProperties::new(schema.clone()),
Partitioning::UnknownPartitioning(1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the single-partition execution intentional for the initial version?

PaimonTableScan hardcodes UnknownPartitioning(1), and execute() plans/reads all Paimon splits inside one execution partition. Since the underlying Paimon scan already produces bin-packed splits, this means we lose DataFusion parallelism completely.

Not necessarily a blocker for the first PR, but I think this limitation should be called out explicitly (e.g. a // TODO comment), or followed up by exposing one execution partition per Paimon split.

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Let's move forward

@JingsongLi JingsongLi merged commit 7d0a80a into apache:main Mar 24, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

introduce paimon-datafusion

3 participants