feat(core): introduce ColumnarRowRef with shared batch context by xylaaaaa · Pull Request #120 · alibaba/paimon-cpp

xylaaaaa · 2026-02-06T06:19:17Z

Background

KeyValueDataFileRecordReader constructs ColumnarRow objects per row, which repeatedly rebuilds row-level column views and introduces avoidable per-row overhead.

Changes

Added ColumnarBatchContext to hold batch-level shared state (StructArray, ArrayVector, MemoryPool).
Added ColumnarRowRef as a row view implementation backed by shared batch context.
Switched KeyValueDataFileRecordReader from ColumnarRow to ColumnarRowRef for key/value row materialization.
Added key_ctx_ / value_ctx_ lifecycle management and reset handling in reader state.
Added unit test coverage for ColumnarRowRef basic field access and RowKind behavior.
Registered columnar_row_ref.cpp in src/paimon/CMakeLists.txt.

Tests

cmake --build build --target paimon-common-test paimon-core-test -j8
./build/relwithdebinfo/paimon-common-test --gtest_filter="ColumnarRowTest.*:ColumnarRowRefTest.*"
./build/relwithdebinfo/paimon-core-test --gtest_filter="KeyValueDataFileRecordReaderTest.*"
./build/relwithdebinfo/paimon-core-test --gtest_filter="*KeyValueProjectionReaderTest*"

All tests passed locally.

Risk / Compatibility

No external interface changes.
Main risk is row-view lifecycle/thread-safety in projection paths; covered by the related reader/projection tests above.

Copilot

Pull request overview

This PR introduces a performance optimization for columnar data access by replacing per-row ColumnarRow construction with a shared batch context approach. The main goal is to reduce per-row overhead in KeyValueDataFileRecordReader by avoiding repeated construction of row-level column views.

Changes:

Introduced ColumnarBatchContext struct to hold batch-level shared state (struct array, array vector, memory pool)
Added ColumnarRowRef class as a lightweight row view backed by shared batch context
Refactored KeyValueDataFileRecordReader to use ColumnarRowRef instead of ColumnarRow for key/value materialization

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/paimon/common/data/columnar/columnar_batch_context.h	New struct to hold batch-level shared state (struct array, array vector holder, and memory pool)
src/paimon/common/data/columnar/columnar_row_ref.h	New lightweight row view class that references shared batch context instead of maintaining per-row copies
src/paimon/common/data/columnar/columnar_row_ref.cpp	Implementation of ColumnarRowRef methods for accessing nested types (decimal, timestamp, row, array, map)
src/paimon/core/io/key_value_data_file_record_reader.h	Added forward declaration and member variables for key_ctx_ and value_ctx_, removed unused columnar_row.h include
src/paimon/core/io/key_value_data_file_record_reader.cpp	Updated to use ColumnarRowRef with shared contexts; added context creation and reset handling
src/paimon/common/data/columnar/columnar_row_test.cpp	Added basic unit test for ColumnarRowRef covering field access and RowKind operations
src/paimon/CMakeLists.txt	Registered columnar_row_ref.cpp in the build configuration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-06T06:27:11Z

src/paimon/core/io/key_value_data_file_record_reader.cpp

+    // as key is only used in merge sort, do not hold the data in ColumnarRowRef
+    auto key = std::make_unique<ColumnarRowRef>(reader_->key_ctx_, cursor_);


The comment states "do not hold the data in ColumnarRowRef" but this is misleading. Both key and value ColumnarRowRef instances hold references to the data through the shared batch context. The actual difference is that key_ctx_ is created with nullptr for struct_array (line 138) while value_ctx_ holds value_struct_array_ (line 139), which means the value context explicitly keeps the parent struct array alive. The comment should be updated to reflect this distinction more accurately, for example: "as key is only used in merge sort, key context does not hold parent struct array".

Copilot · 2026-02-06T06:27:11Z

src/paimon/core/io/key_value_data_file_record_reader.cpp

    // as value is used in merge sort and projection (maybe async and multi-thread), hold the data
-    // in ColumnarRow
-    auto value = std::make_unique<ColumnarRow>(reader_->value_struct_array_, reader_->value_fields_,
-                                               reader_->pool_, cursor_);
+    // in ColumnarRowRef
+    auto value = std::make_unique<ColumnarRowRef>(reader_->value_ctx_, cursor_);


The comment states "hold the data in ColumnarRowRef" which is technically correct but doesn't clearly explain the distinction from the key case. The key difference is that value_ctx_ is created with value_struct_array_ as the first argument (line 139), which explicitly keeps the parent struct array alive, whereas key_ctx_ uses nullptr (line 138). Consider updating the comment to: "as value is used in merge sort and projection (maybe async and multi-thread), value context holds the parent struct array to ensure data remains valid".

lszskye · 2026-02-06T07:44:02Z

src/paimon/common/data/columnar/columnar_batch_context.h

@@ -0,0 +1,49 @@
+/*
+ * Copyright 2024-present Alibaba Inc.


lxy-9602 · 2026-02-06T08:19:21Z

src/paimon/common/data/columnar/columnar_batch_context.h

+    ColumnarBatchContext(std::shared_ptr<arrow::StructArray> struct_array_in,
+                         const arrow::ArrayVector& array_vec_in,
+                         std::shared_ptr<MemoryPool> pool_in)
+        : struct_array(std::move(struct_array_in)),


Minor suggestion: To keep parameter passing style consistent, could we pass all std::shared_ptr and container arguments (like struct_array_in, array_vec_in, pool_in) as const T&?

lxy-9602 · 2026-02-06T08:28:26Z

src/paimon/common/data/columnar/columnar_row_ref.cpp

+    assert(list_array);
+    int32_t offset = list_array->value_offset(row_id_);
+    int32_t length = list_array->value_length(row_id_);
+    return std::make_shared<ColumnarArray>(list_array->values(), ctx_->pool, offset, length);


If a ColumnarArray contains nested rows (e.g., an array of structs), should it return a ColumnarRowRef for inner row access?

lxy-9602 · 2026-02-06T08:38:28Z

src/paimon/core/io/key_value_data_file_record_reader.h

 class Metrics;
+struct ColumnarBatchContext;

 // Convert the arrow array of data file into a KeyValue object iterator (parsing SEQUENCE_NUMBER and


The KeyValueInMemoryRecordReader is currently used in the PK table write path — it seems this optimization could also be applied there? It might be a good candidate for a follow-up commit.

lxy-9602 · 2026-02-06T08:52:14Z

src/paimon/common/data/columnar/columnar_batch_context.h

+    std::shared_ptr<MemoryPool> pool;
+    arrow::ArrayVector array_vec_holder;
+    std::vector<const arrow::Array*> array_vec;
+};


Previously, the struct_array_ field in ColumnarRow was used solely as a data holder and could not be directly utilized, since it might contain fields beyond those present in array_vec_.

In the new ColumnarBatchContext, however, array_vec_holder appears to serve as the primary source of columnar data. Given that a file may contain thousands of fields, maintaining separate holders like array_vec_holder could lead to unnecessary shared_ptr increments.

Suggestion: Consider using struct_array as the canonical holder and removing array_vec_holder to simplify the design and eliminate redundancy.

lxy-9602 · 2026-02-06T08:53:20Z

src/paimon/common/data/columnar/columnar_row_ref.cpp

+std::shared_ptr<InternalRow> ColumnarRowRef::GetRow(int32_t pos, int32_t num_fields) const {
+    auto struct_array =
+        arrow::internal::checked_pointer_cast<arrow::StructArray>(ctx_->array_vec_holder[pos]);
+    assert(struct_array);


May change array_vec_holderto array_vec for consistent?

Add ColumnarBatchContext and ColumnarRowRef to reduce per-row construction overhead in KeyValueDataFileRecordReader. Switch key/value row construction in KeyValueDataFileRecordReader to ColumnarRowRef and manage batch-level contexts through reader lifecycle. Add unit coverage for ColumnarRowRef basic access and RowKind behavior.

Copilot AI review requested due to automatic review settings February 6, 2026 06:19

Copilot started reviewing on behalf of xylaaaaa February 6, 2026 06:19 View session

Copilot AI reviewed Feb 6, 2026

View reviewed changes

lszskye reviewed Feb 6, 2026

View reviewed changes

lxy-9602 reviewed Feb 6, 2026

View reviewed changes

morningman mentioned this pull request Feb 9, 2026

Support Paimon data write-back: Implement write support for Paimon data. apache/doris#56005

Open

xylaaaaa added 3 commits February 13, 2026 13:16

style: fix clang-format issues in columnar row files

763d1aa

refactor(core): align columnar context naming and row ref usage

50fd013

xylaaaaa force-pushed the feature/columnar-row-ref branch from 48b8cbb to 50fd013 Compare February 13, 2026 06:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core): introduce ColumnarRowRef with shared batch context#120

feat(core): introduce ColumnarRowRef with shared batch context#120
xylaaaaa wants to merge 3 commits intoalibaba:mainfrom
xylaaaaa:feature/columnar-row-ref

xylaaaaa commented Feb 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 6, 2026

Uh oh!

Copilot AI Feb 6, 2026

Uh oh!

lszskye Feb 6, 2026

Uh oh!

lxy-9602 Feb 6, 2026

Uh oh!

lxy-9602 Feb 6, 2026

Uh oh!

lxy-9602 Feb 6, 2026

Uh oh!

lxy-9602 Feb 6, 2026 •

edited

Loading

Uh oh!

lxy-9602 Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		// as key is only used in merge sort, do not hold the data in ColumnarRowRef
		auto key = std::make_unique<ColumnarRowRef>(reader_->key_ctx_, cursor_);

Conversation

xylaaaaa commented Feb 6, 2026

Background

Changes

Tests

Risk / Compatibility

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

lszskye Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

lxy-9602 Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

lxy-9602 Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

lxy-9602 Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

lxy-9602 Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lxy-9602 Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lxy-9602 Feb 6, 2026 •

edited

Loading