feat(core): introduce ColumnarRowRef with shared batch context#120
feat(core): introduce ColumnarRowRef with shared batch context#120xylaaaaa wants to merge 3 commits intoalibaba:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a performance optimization for columnar data access by replacing per-row ColumnarRow construction with a shared batch context approach. The main goal is to reduce per-row overhead in KeyValueDataFileRecordReader by avoiding repeated construction of row-level column views.
Changes:
- Introduced
ColumnarBatchContextstruct to hold batch-level shared state (struct array, array vector, memory pool) - Added
ColumnarRowRefclass as a lightweight row view backed by shared batch context - Refactored
KeyValueDataFileRecordReaderto useColumnarRowRefinstead ofColumnarRowfor key/value materialization
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/paimon/common/data/columnar/columnar_batch_context.h | New struct to hold batch-level shared state (struct array, array vector holder, and memory pool) |
| src/paimon/common/data/columnar/columnar_row_ref.h | New lightweight row view class that references shared batch context instead of maintaining per-row copies |
| src/paimon/common/data/columnar/columnar_row_ref.cpp | Implementation of ColumnarRowRef methods for accessing nested types (decimal, timestamp, row, array, map) |
| src/paimon/core/io/key_value_data_file_record_reader.h | Added forward declaration and member variables for key_ctx_ and value_ctx_, removed unused columnar_row.h include |
| src/paimon/core/io/key_value_data_file_record_reader.cpp | Updated to use ColumnarRowRef with shared contexts; added context creation and reset handling |
| src/paimon/common/data/columnar/columnar_row_test.cpp | Added basic unit test for ColumnarRowRef covering field access and RowKind operations |
| src/paimon/CMakeLists.txt | Registered columnar_row_ref.cpp in the build configuration |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // as key is only used in merge sort, do not hold the data in ColumnarRowRef | ||
| auto key = std::make_unique<ColumnarRowRef>(reader_->key_ctx_, cursor_); |
There was a problem hiding this comment.
The comment states "do not hold the data in ColumnarRowRef" but this is misleading. Both key and value ColumnarRowRef instances hold references to the data through the shared batch context. The actual difference is that key_ctx_ is created with nullptr for struct_array (line 138) while value_ctx_ holds value_struct_array_ (line 139), which means the value context explicitly keeps the parent struct array alive. The comment should be updated to reflect this distinction more accurately, for example: "as key is only used in merge sort, key context does not hold parent struct array".
| // as value is used in merge sort and projection (maybe async and multi-thread), hold the data | ||
| // in ColumnarRow | ||
| auto value = std::make_unique<ColumnarRow>(reader_->value_struct_array_, reader_->value_fields_, | ||
| reader_->pool_, cursor_); | ||
| // in ColumnarRowRef | ||
| auto value = std::make_unique<ColumnarRowRef>(reader_->value_ctx_, cursor_); |
There was a problem hiding this comment.
The comment states "hold the data in ColumnarRowRef" which is technically correct but doesn't clearly explain the distinction from the key case. The key difference is that value_ctx_ is created with value_struct_array_ as the first argument (line 139), which explicitly keeps the parent struct array alive, whereas key_ctx_ uses nullptr (line 138). Consider updating the comment to: "as value is used in merge sort and projection (maybe async and multi-thread), value context holds the parent struct array to ensure data remains valid".
| @@ -0,0 +1,49 @@ | |||
| /* | |||
| * Copyright 2024-present Alibaba Inc. | |||
There was a problem hiding this comment.
Copyright 2026-present Alibaba Inc.
| ColumnarBatchContext(std::shared_ptr<arrow::StructArray> struct_array_in, | ||
| const arrow::ArrayVector& array_vec_in, | ||
| std::shared_ptr<MemoryPool> pool_in) | ||
| : struct_array(std::move(struct_array_in)), |
There was a problem hiding this comment.
Minor suggestion: To keep parameter passing style consistent, could we pass all std::shared_ptr and container arguments (like struct_array_in, array_vec_in, pool_in) as const T&?
| assert(list_array); | ||
| int32_t offset = list_array->value_offset(row_id_); | ||
| int32_t length = list_array->value_length(row_id_); | ||
| return std::make_shared<ColumnarArray>(list_array->values(), ctx_->pool, offset, length); |
There was a problem hiding this comment.
If a ColumnarArray contains nested rows (e.g., an array of structs), should it return a ColumnarRowRef for inner row access?
| class Metrics; | ||
| struct ColumnarBatchContext; | ||
|
|
||
| // Convert the arrow array of data file into a KeyValue object iterator (parsing SEQUENCE_NUMBER and |
There was a problem hiding this comment.
The KeyValueInMemoryRecordReader is currently used in the PK table write path — it seems this optimization could also be applied there? It might be a good candidate for a follow-up commit.
| std::shared_ptr<MemoryPool> pool; | ||
| arrow::ArrayVector array_vec_holder; | ||
| std::vector<const arrow::Array*> array_vec; | ||
| }; |
There was a problem hiding this comment.
Previously, the struct_array_ field in ColumnarRow was used solely as a data holder and could not be directly utilized, since it might contain fields beyond those present in array_vec_.
In the new ColumnarBatchContext, however, array_vec_holder appears to serve as the primary source of columnar data. Given that a file may contain thousands of fields, maintaining separate holders like array_vec_holder could lead to unnecessary shared_ptr increments.
Suggestion: Consider using struct_array as the canonical holder and removing array_vec_holder to simplify the design and eliminate redundancy.
| std::shared_ptr<InternalRow> ColumnarRowRef::GetRow(int32_t pos, int32_t num_fields) const { | ||
| auto struct_array = | ||
| arrow::internal::checked_pointer_cast<arrow::StructArray>(ctx_->array_vec_holder[pos]); | ||
| assert(struct_array); |
There was a problem hiding this comment.
May change array_vec_holderto array_vec for consistent?
Add ColumnarBatchContext and ColumnarRowRef to reduce per-row construction overhead in KeyValueDataFileRecordReader. Switch key/value row construction in KeyValueDataFileRecordReader to ColumnarRowRef and manage batch-level contexts through reader lifecycle. Add unit coverage for ColumnarRowRef basic access and RowKind behavior.
48b8cbb to
50fd013
Compare
Background
KeyValueDataFileRecordReaderconstructsColumnarRowobjects per row, which repeatedly rebuilds row-level column views and introduces avoidable per-row overhead.Changes
ColumnarBatchContextto hold batch-level shared state (StructArray,ArrayVector,MemoryPool).ColumnarRowRefas a row view implementation backed by shared batch context.KeyValueDataFileRecordReaderfromColumnarRowtoColumnarRowReffor key/value row materialization.key_ctx_/value_ctx_lifecycle management and reset handling in reader state.ColumnarRowRefbasic field access andRowKindbehavior.columnar_row_ref.cppinsrc/paimon/CMakeLists.txt.Tests
cmake --build build --target paimon-common-test paimon-core-test -j8./build/relwithdebinfo/paimon-common-test --gtest_filter="ColumnarRowTest.*:ColumnarRowRefTest.*"./build/relwithdebinfo/paimon-core-test --gtest_filter="KeyValueDataFileRecordReaderTest.*"./build/relwithdebinfo/paimon-core-test --gtest_filter="*KeyValueProjectionReaderTest*"All tests passed locally.
Risk / Compatibility