refactor(core): apply ColumnarRowRef in KeyValueInMemoryRecordReader by xylaaaaa · Pull Request #144 · alibaba/paimon-cpp

xylaaaaa · 2026-02-13T06:28:45Z

Purpose

Linked issue: close #xxx

This PR applies the ColumnarRowRef optimization to KeyValueInMemoryRecordReader (PK write path), to reduce per-row row-view construction overhead and align behavior with KeyValueDataFileRecordReader.

Changes:

Replace per-row ColumnarRow construction with ColumnarRowRef for both key/value materialization in KeyValueInMemoryRecordReader::Iterator::Next.
Add batch-level contexts key_ctx_ and value_ctx_ in KeyValueInMemoryRecordReader.
Initialize contexts in NextBatch() and reset them in Close().
Keep lifecycle semantics unchanged: both key/value contexts hold value_struct_array_, matching previous behavior where key/value rows held the parent struct array.

Tests

cmake --build build --target paimon-core-test -j60
./build/relwithdebinfo/paimon-core-test --gtest_filter="KeyValueInMemoryRecordReaderTest.*"

All listed tests passed locally.

API and Format

No API change in include/.
No storage format or protocol change.

Documentation

No documentation update is required, since this is an internal refactor/performance optimization and introduces no new user-facing feature.

Add ColumnarBatchContext and ColumnarRowRef to reduce per-row construction overhead in KeyValueDataFileRecordReader. Switch key/value row construction in KeyValueDataFileRecordReader to ColumnarRowRef and manage batch-level contexts through reader lifecycle. Add unit coverage for ColumnarRowRef basic access and RowKind behavior.

Copilot

Pull request overview

This PR refactors the core KeyValue record readers to use a shared, batch-level ColumnarBatchContext with ColumnarRowRef row views, reducing per-row construction overhead in both in-memory (PK write path) and data-file readers.

Changes:

Replace per-row ColumnarRow creation with ColumnarRowRef in KeyValueInMemoryRecordReader and KeyValueDataFileRecordReader.
Introduce ColumnarBatchContext to cache field arrays + raw pointers, and manage per-batch lifecycle via reader-owned contexts.
Add initial unit coverage for ColumnarRowRef and update ColumnarArray::GetRow to return ColumnarRowRef.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/paimon/core/io/key_value_in_memory_record_reader.h	Add batch-level key/value `ColumnarBatchContext` members.
src/paimon/core/io/key_value_in_memory_record_reader.cpp	Switch key/value materialization to `ColumnarRowRef`; init/reset contexts in batch lifecycle.
src/paimon/core/io/key_value_data_file_record_reader.h	Add batch-level key/value `ColumnarBatchContext` members; remove `ColumnarRow` include.
src/paimon/core/io/key_value_data_file_record_reader.cpp	Switch key/value materialization to `ColumnarRowRef`; init/reset contexts per batch.
src/paimon/common/data/columnar/columnar_row_ref.h	Add new row-view type backed by shared batch context.
src/paimon/common/data/columnar/columnar_row_ref.cpp	Implement `ColumnarRowRef` accessors (decimal/timestamp/nested/list/map).
src/paimon/common/data/columnar/columnar_batch_context.h	Add shared context object that owns field arrays and cached raw pointers.
src/paimon/common/data/columnar/columnar_array.cpp	Return `ColumnarRowRef` for nested struct row access.
src/paimon/common/data/columnar/columnar_row_test.cpp	Add basic unit test for `ColumnarRowRef`.
src/paimon/CMakeLists.txt	Build new `columnar_row_ref.cpp`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-13T06:36:58Z

src/paimon/common/data/columnar/columnar_row_test.cpp

+    ASSERT_TRUE(updated_kind.ok());
+    ASSERT_EQ(updated_kind.value(), RowKind::Delete());
+}
+


The new ColumnarRowRefTest only validates simple int/string access and RowKind. Since ColumnarRowRef is now used by multiple readers and has separate implementations for nested row/list/map, decimal, and timestamp access, it would be good to extend test coverage to include at least one nested struct/list/map case (and ideally decimal/timestamp) to ensure parity with ColumnarRow behavior.

Suggested change

TEST(ColumnarRowRefTest, TestComplexAndNestedType) {

auto pool = GetDefaultPool();

// Schema with decimal, timestamp, nested struct, list, and map fields.

std::shared_ptr<arrow::DataType> target_type = arrow::struct_({

arrow::field("dec", arrow::decimal128(10, 3)),

arrow::field("ts", arrow::timestamp(arrow::TimeUnit::NANO)),

arrow::field("nested", arrow::struct_({arrow::field("inner", arrow::int64())})),

arrow::field("lst", arrow::list(arrow::int32())),

arrow::field("mp", arrow::map(arrow::int32(), arrow::int64())),

});

auto f_dec = arrow::ipc::internal::json::ArrayFromJSON(

arrow::decimal128(10, 3), R"(["123.456", "789.000"])")

.ValueOrDie();

auto f_ts = arrow::ipc::internal::json::ArrayFromJSON(

arrow::timestamp(arrow::TimeUnit::NANO), R"([0, 1000])")

.ValueOrDie();

auto nested_type = arrow::struct_({arrow::field("inner", arrow::int64())});

auto f_nested = arrow::ipc::internal::json::ArrayFromJSON(

nested_type, R"([{"inner": 10}, {"inner": 20}])")

.ValueOrDie();

auto f_list = arrow::ipc::internal::json::ArrayFromJSON(

arrow::list(arrow::int32()), R"([[1, 2], [3, 4]])")

.ValueOrDie();

auto f_map = arrow::ipc::internal::json::ArrayFromJSON(

arrow::map(arrow::int32(), arrow::int64()),

R"([[{"key": 1, "value": 10}], [{"key": 2, "value": 20}]])")

.ValueOrDie();

auto data =

arrow::StructArray::Make({f_dec, f_ts, f_nested, f_list, f_map}, target_type->fields())

.ValueOrDie();

auto ctx = std::make_shared<ColumnarBatchContext>(data, data->fields(), pool);

// Use the second row (index 1) to exercise non-zero offsets into nested values.

ColumnarRowRef row(ctx, 1);

ASSERT_EQ(row.GetFieldCount(), 5);

auto row_kind2 = row.GetRowKind();

ASSERT_TRUE(row_kind2.ok());

ASSERT_EQ(row_kind2.value(), RowKind::Insert());

// Change the RowKind to verify metadata handling on complex schemas.

row.SetRowKind(RowKind::UpdateAfter());

auto updated_kind2 = row.GetRowKind();

ASSERT_TRUE(updated_kind2.ok());

ASSERT_EQ(updated_kind2.value(), RowKind::UpdateAfter());

}

Copilot · 2026-02-13T06:36:59Z

src/paimon/common/data/columnar/columnar_row_ref.cpp

+    return Decimal(precision, scale,
+                   static_cast<Decimal::int128_t>(decimal.high_bits()) << 64 | decimal.low_bits());


ColumnarRowRef::GetDecimal reconstructs the 128-bit value via static_cast<Decimal::int128_t>(decimal.high_bits()) << 64 | decimal.low_bits(). If high_bits() is negative (e.g., for negative decimals), left-shifting a negative signed value is undefined behavior in C++. Consider assembling using an unsigned intermediate (Decimal::uint128_t) and then casting to Decimal::int128_t to avoid UB while preserving the bit pattern.

Suggested change

return Decimal(precision, scale,

static_cast<Decimal::int128_t>(decimal.high_bits()) << 64 | decimal.low_bits());

Decimal::uint128_t value =

(static_cast<Decimal::uint128_t>(decimal.high_bits()) << 64) |

static_cast<Decimal::uint128_t>(decimal.low_bits());

return Decimal(precision, scale, static_cast<Decimal::int128_t>(value));

Copilot · 2026-02-13T06:36:59Z

src/paimon/common/data/columnar/columnar_batch_context.h

+struct ColumnarBatchContext {
+    ColumnarBatchContext(const std::shared_ptr<arrow::StructArray>& struct_array_in,
+                         const arrow::ArrayVector& field_arrays_in,
+                         const std::shared_ptr<MemoryPool>& pool_in)
+        : struct_array(struct_array_in), pool(pool_in), field_arrays(field_arrays_in) {
+        array_ptrs.reserve(field_arrays.size());
+        for (const auto& array : field_arrays) {
+            array_ptrs.push_back(array.get());
+        }
+    }
+
+    std::shared_ptr<arrow::StructArray> struct_array;
+    std::shared_ptr<MemoryPool> pool;
+    arrow::ArrayVector field_arrays;
+    std::vector<const arrow::Array*> array_ptrs;


ColumnarBatchContext exposes field_arrays and array_ptrs as public mutable members, but array_ptrs is derived from field_arrays only once in the constructor. If any caller mutates field_arrays later (push/erase/reassign), array_ptrs can become out of sync and lead to invalid pointer dereferences. Consider making these members private/const (or storing only field_arrays and computing array_ptrs on demand) to enforce the invariant.

Suggested change

struct ColumnarBatchContext {

ColumnarBatchContext(const std::shared_ptr<arrow::StructArray>& struct_array_in,

const arrow::ArrayVector& field_arrays_in,

const std::shared_ptr<MemoryPool>& pool_in)

: struct_array(struct_array_in), pool(pool_in), field_arrays(field_arrays_in) {

array_ptrs.reserve(field_arrays.size());

for (const auto& array : field_arrays) {

array_ptrs.push_back(array.get());

}

}

std::shared_ptr<arrow::StructArray> struct_array;

std::shared_ptr<MemoryPool> pool;

arrow::ArrayVector field_arrays;

std::vector<const arrow::Array*> array_ptrs;

class ColumnarBatchContext {

public:

ColumnarBatchContext(const std::shared_ptr<arrow::StructArray>& struct_array_in,

const arrow::ArrayVector& field_arrays_in,

const std::shared_ptr<MemoryPool>& pool_in)

: struct_array(struct_array_in), pool(pool_in), field_arrays_(field_arrays_in) {

array_ptrs_.reserve(field_arrays_.size());

for (const auto& array : field_arrays_) {

array_ptrs_.push_back(array.get());

}

}

std::shared_ptr<arrow::StructArray> struct_array;

std::shared_ptr<MemoryPool> pool;

const arrow::ArrayVector& field_arrays() const { return field_arrays_; }

const std::vector<const arrow::Array*>& array_ptrs() const { return array_ptrs_; }

private:

arrow::ArrayVector field_arrays_;

std::vector<const arrow::Array*> array_ptrs_;

xylaaaaa added 4 commits February 13, 2026 13:16

style: fix clang-format issues in columnar row files

763d1aa

refactor(core): align columnar context naming and row ref usage

50fd013

refactor(core): apply ColumnarRowRef in in-memory kv reader

5840cbc

Copilot AI review requested due to automatic review settings February 13, 2026 06:28

Copilot started reviewing on behalf of xylaaaaa February 13, 2026 06:29 View session

Copilot AI reviewed Feb 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(core): apply ColumnarRowRef in KeyValueInMemoryRecordReader#144

refactor(core): apply ColumnarRowRef in KeyValueInMemoryRecordReader#144
xylaaaaa wants to merge 4 commits intoalibaba:mainfrom
xylaaaaa:followup/in-memory-row-ref

xylaaaaa commented Feb 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 13, 2026

Uh oh!

Copilot AI Feb 13, 2026

Uh oh!

Copilot AI Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

+TEST(ColumnarRowRefTest, TestComplexAndNestedType) {
+    auto pool = GetDefaultPool();
+    // Schema with decimal, timestamp, nested struct, list, and map fields.
+    std::shared_ptr<arrow::DataType> target_type = arrow::struct_({
+        arrow::field("dec", arrow::decimal128(10, 3)),
+        arrow::field("ts", arrow::timestamp(arrow::TimeUnit::NANO)),
+        arrow::field("nested", arrow::struct_({arrow::field("inner", arrow::int64())})),
+        arrow::field("lst", arrow::list(arrow::int32())),
+        arrow::field("mp", arrow::map(arrow::int32(), arrow::int64())),
+    });
+    auto f_dec = arrow::ipc::internal::json::ArrayFromJSON(
+                     arrow::decimal128(10, 3), R"(["123.456", "789.000"])")
+                     .ValueOrDie();
+    auto f_ts = arrow::ipc::internal::json::ArrayFromJSON(
+                    arrow::timestamp(arrow::TimeUnit::NANO), R"([0, 1000])")
+                    .ValueOrDie();
+    auto nested_type = arrow::struct_({arrow::field("inner", arrow::int64())});
+    auto f_nested = arrow::ipc::internal::json::ArrayFromJSON(
+                        nested_type, R"([{"inner": 10}, {"inner": 20}])")
+                        .ValueOrDie();
+    auto f_list = arrow::ipc::internal::json::ArrayFromJSON(
+                      arrow::list(arrow::int32()), R"([[1, 2], [3, 4]])")
+                      .ValueOrDie();
+    auto f_map = arrow::ipc::internal::json::ArrayFromJSON(
+                     arrow::map(arrow::int32(), arrow::int64()),
+                     R"([[{"key": 1, "value": 10}], [{"key": 2, "value": 20}]])")
+                     .ValueOrDie();
+    auto data =
+        arrow::StructArray::Make({f_dec, f_ts, f_nested, f_list, f_map}, target_type->fields())
+            .ValueOrDie();
+    auto ctx = std::make_shared<ColumnarBatchContext>(data, data->fields(), pool);
+    // Use the second row (index 1) to exercise non-zero offsets into nested values.
+    ColumnarRowRef row(ctx, 1);
+    ASSERT_EQ(row.GetFieldCount(), 5);
+    auto row_kind2 = row.GetRowKind();
+    ASSERT_TRUE(row_kind2.ok());
+    ASSERT_EQ(row_kind2.value(), RowKind::Insert());
+    // Change the RowKind to verify metadata handling on complex schemas.
+    row.SetRowKind(RowKind::UpdateAfter());
+    auto updated_kind2 = row.GetRowKind();
+    ASSERT_TRUE(updated_kind2.ok());
+    ASSERT_EQ(updated_kind2.value(), RowKind::UpdateAfter());
+}

		return Decimal(precision, scale,
		static_cast<Decimal::int128_t>(decimal.high_bits()) << 64 \| decimal.low_bits());

-    return Decimal(precision, scale,
-                   static_cast<Decimal::int128_t>(decimal.high_bits()) << 64 | decimal.low_bits());
+    Decimal::uint128_t value =
+        (static_cast<Decimal::uint128_t>(decimal.high_bits()) << 64) |
+        static_cast<Decimal::uint128_t>(decimal.low_bits());
+    return Decimal(precision, scale, static_cast<Decimal::int128_t>(value));

Conversation

xylaaaaa commented Feb 13, 2026

Purpose

Tests

API and Format

Documentation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant