Skip to content

feat(table): support column projection in ReadBuilder #146

@QuakeWang

Description

@QuakeWang

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

Currently, ReadBuilder::new_read() always passes the full table schema fields to TableRead, which means all columns are read from Parquet data files even when users only need a few columns.

For wide tables with dozens or hundreds of columns, reading all columns introduces significant unnecessary I/O overhead, especially on remote storage (S3/OSS). Column projection (a.k.a. column pruning) allows users to specify which columns to read, and skip reading the rest entirely at the Parquet level.

The underlying ArrowReader already supports column clipping via Parquet ProjectionMask, but this capability is not exposed through the public ReadBuilder / TableRead API.

Solution

Add a with_projection method to ReadBuilder that accepts a list of column names. When new_read() is called, filter table.schema().fields() to only include the projected columns, then pass the filtered Vec<DataField> to TableRead and subsequently to ArrowReader.

Anything else?

This is a prerequisite for efficient table reads and a building block toward predicate pushdown support.

Willingness to contribute

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions