refactor(memory): Refactor memory pool adaptor#119
refactor(memory): Refactor memory pool adaptor#119Eyizoha wants to merge 12 commits intoalibaba:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors the memory pool adaptor pattern to address a critical lifecycle coupling issue between BatchReader and ReadBatch. Previously, adaptors were created and held by readers, requiring readers to outlive the batches they returned to prevent dangling pointer access. The refactoring moves adaptor ownership into the MemoryPool itself, ensuring consistent lifecycles and enabling batches to safely outlive readers.
Changes:
- Introduces
AsArrowMemoryPool()andAsOrcMemoryPool()methods to MemoryPool for lazy, thread-safe adaptor creation - Implements factory pattern for adaptor creation with support for pluggable format libraries
- Refactors all Arrow and ORC format code to use new memory pool API instead of creating separate adaptors
- Removes approximately 100+ adaptor-related member variables and includes across the codebase
Reviewed changes
Copilot reviewed 83 out of 83 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| include/paimon/memory/memory_pool.h | Adds AsArrowMemoryPool() and AsOrcMemoryPool() methods with thread-safe lazy initialization |
| src/paimon/common/memory/memory_pool.cpp | Implements adaptor holder and lazy creation logic using std::call_once |
| src/paimon/common/memory/memory_pool_adaptor_*.{h,cpp} | New infrastructure for factory pattern and adaptor management |
| src/paimon/common/utils/arrow/arrow_memory_pool_adaptor*.{h,cpp} | Simplifies Arrow adaptor to take reference instead of shared_ptr |
| src/paimon/format/orc/orc_memory_pool_adaptor*.{h,cpp} | Simplifies ORC adaptor to take reference instead of shared_ptr |
| src/paimon/format/parquet/*.cpp | Updates all Parquet code to use pool->AsArrowMemoryPool() |
| src/paimon/format/orc/*.cpp | Updates all ORC code to use pool->AsOrcMemoryPool() |
| src/paimon/format/avro/*.cpp | Updates Avro code to use new memory pool API |
| src/paimon/format/blob/*.cpp | Updates Blob format code to use new memory pool API |
| src/paimon/core/**/*.cpp | Updates core components to use new memory pool API |
| src/paimon/common/reader/*.cpp | Updates reader infrastructure to use new memory pool API |
| examples/read_write_demo.cpp | Demonstrates new decoupling by explicitly closing reader before using batches |
Comments suppressed due to low confidence (1)
src/paimon/common/utils/arrow/arrow_memory_pool_adaptor_factor.h:2
- The copyright year in the header is listed as 2026, but the current date is February 2026. This should likely be "2024-present" or "2025-present" to match the existing codebase pattern, unless this file was genuinely created in 2026. Please verify and update the copyright year to be consistent with the actual file creation date and codebase conventions.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1,30 @@ | |||
| /* | |||
| * Copyright 2026-present Alibaba Inc. | |||
There was a problem hiding this comment.
The copyright year in the header is listed as 2026, but the current date is February 2026. This should likely be "2024-present" or "2025-present" to match the existing codebase pattern, unless this file was genuinely created in 2026. Please verify and update the copyright year to be consistent with the actual file creation date and codebase conventions.
| @@ -0,0 +1,34 @@ | |||
| /* | |||
| * Copyright 2026-present Alibaba Inc. | |||
There was a problem hiding this comment.
The copyright year in the header is listed as 2026, but the current date is February 2026. This should likely be "2024-present" or "2025-present" to match the existing codebase pattern, unless this file was genuinely created in 2026. Please verify and update the copyright year to be consistent with the actual file creation date and codebase conventions.
| @@ -0,0 +1,34 @@ | |||
| /* | |||
| * Copyright 2026-present Alibaba Inc. | |||
There was a problem hiding this comment.
The copyright year in the header is listed as 2026, but the current date is February 2026. This should likely be "2024-present" or "2025-present" to match the existing codebase pattern, unless this file was genuinely created in 2026. Please verify and update the copyright year to be consistent with the actual file creation date and codebase conventions.
| * Copyright 2026-present Alibaba Inc. | |
| * Copyright 2024-present Alibaba Inc. |
| @@ -0,0 +1,43 @@ | |||
| /* | |||
| * Copyright 2026-present Alibaba Inc. | |||
There was a problem hiding this comment.
The copyright year in the header is listed as 2026, but the current date is February 2026. This should likely be "2024-present" or "2025-present" to match the existing codebase pattern, unless this file was genuinely created in 2026. Please verify and update the copyright year to be consistent with the actual file creation date and codebase conventions.
| * Copyright 2026-present Alibaba Inc. | |
| * Copyright 2024-present Alibaba Inc. |
| @@ -0,0 +1,35 @@ | |||
| /* | |||
| * Copyright 2026-present Alibaba Inc. | |||
There was a problem hiding this comment.
The copyright year in the header is listed as 2026, but the current date is February 2026. This should likely be "2024-present" or "2025-present" to match the existing codebase pattern, unless this file was genuinely created in 2026. Please verify and update the copyright year to be consistent with the actual file creation date and codebase conventions.
| * Copyright 2026-present Alibaba Inc. | |
| * Copyright 2024-present Alibaba Inc. |
| auto orc_pool = std::make_shared<OrcMemoryPool>(pool); | ||
| std::shared_ptr<arrow::MemoryPool> arrow_pool = GetArrowPool(pool); | ||
| reader_options.setMemoryPool(*orc_pool); | ||
| reader_options.setMemoryPool(*pool->AsOrcMemoryPool()); |
There was a problem hiding this comment.
The code dereferences the result of AsOrcMemoryPool() without checking for nullptr. According to the documentation in memory_pool.h, AsOrcMemoryPool() can return nullptr if the ORC format library is not linked. This dereferencing will cause undefined behavior (likely a crash) if the ORC adaptor factory is not registered. Consider adding a nullptr check and returning an appropriate error if the ORC memory pool is unavailable.
| @@ -0,0 +1,43 @@ | |||
| /* | |||
| * Copyright 2026-present Alibaba Inc. | |||
There was a problem hiding this comment.
The copyright year in the header is listed as 2026, but the current date is February 2026. This should likely be "2024-present" or "2025-present" to match the existing codebase pattern, unless this file was genuinely created in 2026. Please verify and update the copyright year to be consistent with the actual file creation date and codebase conventions.
| * Copyright 2026-present Alibaba Inc. | |
| * Copyright 2025-present Alibaba Inc. |
| @@ -0,0 +1,43 @@ | |||
| /* | |||
| * Copyright 2026-present Alibaba Inc. | |||
There was a problem hiding this comment.
The copyright year in the header is listed as 2026, but the current date is February 2026. This should likely be "2024-present" or "2025-present" to match the existing codebase pattern, unless this file was genuinely created in 2026. Please verify and update the copyright year to be consistent with the actual file creation date and codebase conventions.
| * Copyright 2026-present Alibaba Inc. | |
| * Copyright 2024-present Alibaba Inc. |
| ::orc::ReaderOptions reader_options; | ||
| auto orc_pool = std::make_shared<OrcMemoryPool>(pool); | ||
| reader_options.setMemoryPool(*orc_pool); | ||
| reader_options.setMemoryPool(*pool->AsOrcMemoryPool()); |
There was a problem hiding this comment.
The code dereferences the result of AsOrcMemoryPool() without checking for nullptr. According to the documentation in memory_pool.h, AsOrcMemoryPool() can return nullptr if the ORC format library is not linked. This dereferencing will cause undefined behavior (likely a crash) if the ORC adaptor factory is not registered. Consider adding a nullptr check and returning an appropriate error if the ORC memory pool is unavailable.
| if (pool) { | ||
| orc_memory_pool = std::make_shared<OrcMemoryPool>(pool); | ||
| writer_options.setMemoryPool(orc_memory_pool.get()); | ||
| writer_options.setMemoryPool(pool->AsOrcMemoryPool()); |
There was a problem hiding this comment.
The code dereferences the result of AsOrcMemoryPool() without checking for nullptr. According to the documentation in memory_pool.h, AsOrcMemoryPool() can return nullptr if the ORC format library is not linked. This dereferencing will cause undefined behavior (likely a crash) if the ORC adaptor factory is not registered. Consider adding a nullptr check and returning an appropriate error if the ORC memory pool is unavailable.
| /// If the ORC format library is not linked, this method returns nullptr. | ||
| /// @note Subclasses can override this method to provide custom ORC integration. | ||
| virtual ::orc::MemoryPool* AsOrcMemoryPool(); | ||
|
|
There was a problem hiding this comment.
The memory pool should not depend on plugin-specific types. Currently, it requires knowledge of concrete reader/format classes, which breaks modularity. A user's custom plugin shouldn't force changes in core memory headers. May need to refactor for better decoupling between memory management and data readers.
There was a problem hiding this comment.
Good point! You're absolutely right that we shouldn't add plugin-specific memory pool interfaces to the core headers. So I refactored it into a more generic template method AsSpecifiedMemoryPool, while keeping the original factory registration mechanism intact. Individual plugins can then wrap this method with convenience functions to simplify usage, like this example for ORC:
inline ::orc::MemoryPool* AsOrcMemoryPool(paimon::MemoryPool& pool) {
return pool.AsSpecifiedMemoryPool<::orc::MemoryPool>();
}
include/paimon/memory/memory_pool.h
Outdated
| /// | ||
| /// @return Pointer to the Arrow memory pool adapter. | ||
| /// @note Subclasses can override this method to provide custom Arrow integration. | ||
| virtual arrow::MemoryPool* AsArrowMemoryPool(); |
There was a problem hiding this comment.
paimon interface only has c arrow api, arrow::MemoryPool is C++ symbol which should not be in here
Purpose
Linked issue: close #126
Background and Motivation:
Currently, due to the Memory Pool Adaptor implementation, the lifecycle of BatchReader must outlive the ReadBatch it returns. Otherwise, batch destruction will cause dangling pointer access. Specifically, to adapt the Paimon memory pool to Arrow or ORC memory pool interfaces, the BatchReader internally wraps the incoming Paimon memory pool using an adaptor and passes it to Arrow or ORC APIs. As a result, the memory pool object referenced by PoolBuffer inside the batch is not the original memory pool object directly passed to the Paimon API, but rather the adaptor object held internally by the BatchReader.
This requires users to keep the BatchReader alive until all returned batches (specifically, the internal PoolBuffers) are released. This constraint couples Reader and Batch in a way that is unintuitive and error-prone for users, and may also limit certain resource optimization techniques.
Solution:
To address this issue, this PR refactors the memory pool adaptor pattern by placing the adaptor directly into the Paimon memory pool, ensuring they share consistent lifecycles. This allows readers to use the adaptors directly without creating and holding them independently. Adaptor creation is lazy/one-time.
This refactoring doesn't prevent existing plugins from continuing to use the original pattern—customizing their own adapters and having readers hold them directly. The downside is that this reintroduces coupling between the plugin's reader and batch components. So if you don't mind the coupling, existing plugins should work as-is without any changes.
Tests
Existing tests cover this change. Additionally, manual testing has verified the decoupling of reader and batch lifecycles.
API and Format
This change introduce a new public methods to the memory pool:
AsSpecifiedMemoryPool, returns a pointer to the specified adaptor.Additionally, introduces
MemoryPoolAdaptor<T>as a CRTP base class for implementing custom memory pool adaptors. Subclasses must:MemoryPoolAdaptor<Subclass>Identifier()methodMemoryPool&The change maintains forward compatibility.
Documentation
No new documentation required.