-
Notifications
You must be signed in to change notification settings - Fork 48
Description
Search before asking
- I searched in the issues and found nothing similar.
Motivation
Currently, ReadBuilder::new_read() always passes the full table schema fields to TableRead, which means all columns are read from Parquet data files even when users only need a few columns.
For wide tables with dozens or hundreds of columns, reading all columns introduces significant unnecessary I/O overhead, especially on remote storage (S3/OSS). Column projection (a.k.a. column pruning) allows users to specify which columns to read, and skip reading the rest entirely at the Parquet level.
The underlying ArrowReader already supports column clipping via Parquet ProjectionMask, but this capability is not exposed through the public ReadBuilder / TableRead API.
Solution
Add a with_projection method to ReadBuilder that accepts a list of column names. When new_read() is called, filter table.schema().fields() to only include the projected columns, then pass the filtered Vec<DataField> to TableRead and subsequently to ArrowReader.
Anything else?
This is a prerequisite for efficient table reads and a building block toward predicate pushdown support.
Willingness to contribute
- I'm willing to submit a PR!