Skip to content

refactor(core): apply ColumnarRowRef in KeyValueInMemoryRecordReader#144

Open
xylaaaaa wants to merge 4 commits intoalibaba:mainfrom
xylaaaaa:followup/in-memory-row-ref
Open

refactor(core): apply ColumnarRowRef in KeyValueInMemoryRecordReader#144
xylaaaaa wants to merge 4 commits intoalibaba:mainfrom
xylaaaaa:followup/in-memory-row-ref

Conversation

@xylaaaaa
Copy link
Contributor

Purpose

Linked issue: close #xxx

This PR applies the ColumnarRowRef optimization to KeyValueInMemoryRecordReader (PK write path), to reduce per-row row-view construction overhead and align behavior with KeyValueDataFileRecordReader.

Changes:

  • Replace per-row ColumnarRow construction with ColumnarRowRef for both key/value materialization in KeyValueInMemoryRecordReader::Iterator::Next.
  • Add batch-level contexts key_ctx_ and value_ctx_ in KeyValueInMemoryRecordReader.
  • Initialize contexts in NextBatch() and reset them in Close().
  • Keep lifecycle semantics unchanged: both key/value contexts hold value_struct_array_, matching previous behavior where key/value rows held the parent struct array.

Tests

  • cmake --build build --target paimon-core-test -j60
  • ./build/relwithdebinfo/paimon-core-test --gtest_filter="KeyValueInMemoryRecordReaderTest.*"

All listed tests passed locally.

API and Format

No API change in include/.
No storage format or protocol change.

Documentation

No documentation update is required, since this is an internal refactor/performance optimization and introduces no new user-facing feature.

Add ColumnarBatchContext and ColumnarRowRef to reduce per-row construction overhead in KeyValueDataFileRecordReader.

Switch key/value row construction in KeyValueDataFileRecordReader to ColumnarRowRef and manage batch-level contexts through reader lifecycle.

Add unit coverage for ColumnarRowRef basic access and RowKind behavior.
Copilot AI review requested due to automatic review settings February 13, 2026 06:28
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the core KeyValue record readers to use a shared, batch-level ColumnarBatchContext with ColumnarRowRef row views, reducing per-row construction overhead in both in-memory (PK write path) and data-file readers.

Changes:

  • Replace per-row ColumnarRow creation with ColumnarRowRef in KeyValueInMemoryRecordReader and KeyValueDataFileRecordReader.
  • Introduce ColumnarBatchContext to cache field arrays + raw pointers, and manage per-batch lifecycle via reader-owned contexts.
  • Add initial unit coverage for ColumnarRowRef and update ColumnarArray::GetRow to return ColumnarRowRef.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/paimon/core/io/key_value_in_memory_record_reader.h Add batch-level key/value ColumnarBatchContext members.
src/paimon/core/io/key_value_in_memory_record_reader.cpp Switch key/value materialization to ColumnarRowRef; init/reset contexts in batch lifecycle.
src/paimon/core/io/key_value_data_file_record_reader.h Add batch-level key/value ColumnarBatchContext members; remove ColumnarRow include.
src/paimon/core/io/key_value_data_file_record_reader.cpp Switch key/value materialization to ColumnarRowRef; init/reset contexts per batch.
src/paimon/common/data/columnar/columnar_row_ref.h Add new row-view type backed by shared batch context.
src/paimon/common/data/columnar/columnar_row_ref.cpp Implement ColumnarRowRef accessors (decimal/timestamp/nested/list/map).
src/paimon/common/data/columnar/columnar_batch_context.h Add shared context object that owns field arrays and cached raw pointers.
src/paimon/common/data/columnar/columnar_array.cpp Return ColumnarRowRef for nested struct row access.
src/paimon/common/data/columnar/columnar_row_test.cpp Add basic unit test for ColumnarRowRef.
src/paimon/CMakeLists.txt Build new columnar_row_ref.cpp.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ASSERT_TRUE(updated_kind.ok());
ASSERT_EQ(updated_kind.value(), RowKind::Delete());
}

Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new ColumnarRowRefTest only validates simple int/string access and RowKind. Since ColumnarRowRef is now used by multiple readers and has separate implementations for nested row/list/map, decimal, and timestamp access, it would be good to extend test coverage to include at least one nested struct/list/map case (and ideally decimal/timestamp) to ensure parity with ColumnarRow behavior.

Suggested change
TEST(ColumnarRowRefTest, TestComplexAndNestedType) {
auto pool = GetDefaultPool();
// Schema with decimal, timestamp, nested struct, list, and map fields.
std::shared_ptr<arrow::DataType> target_type = arrow::struct_({
arrow::field("dec", arrow::decimal128(10, 3)),
arrow::field("ts", arrow::timestamp(arrow::TimeUnit::NANO)),
arrow::field("nested", arrow::struct_({arrow::field("inner", arrow::int64())})),
arrow::field("lst", arrow::list(arrow::int32())),
arrow::field("mp", arrow::map(arrow::int32(), arrow::int64())),
});
auto f_dec = arrow::ipc::internal::json::ArrayFromJSON(
arrow::decimal128(10, 3), R"(["123.456", "789.000"])")
.ValueOrDie();
auto f_ts = arrow::ipc::internal::json::ArrayFromJSON(
arrow::timestamp(arrow::TimeUnit::NANO), R"([0, 1000])")
.ValueOrDie();
auto nested_type = arrow::struct_({arrow::field("inner", arrow::int64())});
auto f_nested = arrow::ipc::internal::json::ArrayFromJSON(
nested_type, R"([{"inner": 10}, {"inner": 20}])")
.ValueOrDie();
auto f_list = arrow::ipc::internal::json::ArrayFromJSON(
arrow::list(arrow::int32()), R"([[1, 2], [3, 4]])")
.ValueOrDie();
auto f_map = arrow::ipc::internal::json::ArrayFromJSON(
arrow::map(arrow::int32(), arrow::int64()),
R"([[{"key": 1, "value": 10}], [{"key": 2, "value": 20}]])")
.ValueOrDie();
auto data =
arrow::StructArray::Make({f_dec, f_ts, f_nested, f_list, f_map}, target_type->fields())
.ValueOrDie();
auto ctx = std::make_shared<ColumnarBatchContext>(data, data->fields(), pool);
// Use the second row (index 1) to exercise non-zero offsets into nested values.
ColumnarRowRef row(ctx, 1);
ASSERT_EQ(row.GetFieldCount(), 5);
auto row_kind2 = row.GetRowKind();
ASSERT_TRUE(row_kind2.ok());
ASSERT_EQ(row_kind2.value(), RowKind::Insert());
// Change the RowKind to verify metadata handling on complex schemas.
row.SetRowKind(RowKind::UpdateAfter());
auto updated_kind2 = row.GetRowKind();
ASSERT_TRUE(updated_kind2.ok());
ASSERT_EQ(updated_kind2.value(), RowKind::UpdateAfter());
}

Copilot uses AI. Check for mistakes.
Comment on lines +37 to +38
return Decimal(precision, scale,
static_cast<Decimal::int128_t>(decimal.high_bits()) << 64 | decimal.low_bits());
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColumnarRowRef::GetDecimal reconstructs the 128-bit value via static_cast<Decimal::int128_t>(decimal.high_bits()) << 64 | decimal.low_bits(). If high_bits() is negative (e.g., for negative decimals), left-shifting a negative signed value is undefined behavior in C++. Consider assembling using an unsigned intermediate (Decimal::uint128_t) and then casting to Decimal::int128_t to avoid UB while preserving the bit pattern.

Suggested change
return Decimal(precision, scale,
static_cast<Decimal::int128_t>(decimal.high_bits()) << 64 | decimal.low_bits());
Decimal::uint128_t value =
(static_cast<Decimal::uint128_t>(decimal.high_bits()) << 64) |
static_cast<Decimal::uint128_t>(decimal.low_bits());
return Decimal(precision, scale, static_cast<Decimal::int128_t>(value));

Copilot uses AI. Check for mistakes.
Comment on lines +31 to +45
struct ColumnarBatchContext {
ColumnarBatchContext(const std::shared_ptr<arrow::StructArray>& struct_array_in,
const arrow::ArrayVector& field_arrays_in,
const std::shared_ptr<MemoryPool>& pool_in)
: struct_array(struct_array_in), pool(pool_in), field_arrays(field_arrays_in) {
array_ptrs.reserve(field_arrays.size());
for (const auto& array : field_arrays) {
array_ptrs.push_back(array.get());
}
}

std::shared_ptr<arrow::StructArray> struct_array;
std::shared_ptr<MemoryPool> pool;
arrow::ArrayVector field_arrays;
std::vector<const arrow::Array*> array_ptrs;
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColumnarBatchContext exposes field_arrays and array_ptrs as public mutable members, but array_ptrs is derived from field_arrays only once in the constructor. If any caller mutates field_arrays later (push/erase/reassign), array_ptrs can become out of sync and lead to invalid pointer dereferences. Consider making these members private/const (or storing only field_arrays and computing array_ptrs on demand) to enforce the invariant.

Suggested change
struct ColumnarBatchContext {
ColumnarBatchContext(const std::shared_ptr<arrow::StructArray>& struct_array_in,
const arrow::ArrayVector& field_arrays_in,
const std::shared_ptr<MemoryPool>& pool_in)
: struct_array(struct_array_in), pool(pool_in), field_arrays(field_arrays_in) {
array_ptrs.reserve(field_arrays.size());
for (const auto& array : field_arrays) {
array_ptrs.push_back(array.get());
}
}
std::shared_ptr<arrow::StructArray> struct_array;
std::shared_ptr<MemoryPool> pool;
arrow::ArrayVector field_arrays;
std::vector<const arrow::Array*> array_ptrs;
class ColumnarBatchContext {
public:
ColumnarBatchContext(const std::shared_ptr<arrow::StructArray>& struct_array_in,
const arrow::ArrayVector& field_arrays_in,
const std::shared_ptr<MemoryPool>& pool_in)
: struct_array(struct_array_in), pool(pool_in), field_arrays_(field_arrays_in) {
array_ptrs_.reserve(field_arrays_.size());
for (const auto& array : field_arrays_) {
array_ptrs_.push_back(array.get());
}
}
std::shared_ptr<arrow::StructArray> struct_array;
std::shared_ptr<MemoryPool> pool;
const arrow::ArrayVector& field_arrays() const { return field_arrays_; }
const std::vector<const arrow::Array*>& array_ptrs() const { return array_ptrs_; }
private:
arrow::ArrayVector field_arrays_;
std::vector<const arrow::Array*> array_ptrs_;

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant