Skip to content

refactor(memory): Refactor memory pool adaptor#119

Open
Eyizoha wants to merge 12 commits intoalibaba:mainfrom
Eyizoha:mempool_adaptor
Open

refactor(memory): Refactor memory pool adaptor#119
Eyizoha wants to merge 12 commits intoalibaba:mainfrom
Eyizoha:mempool_adaptor

Conversation

@Eyizoha
Copy link
Contributor

@Eyizoha Eyizoha commented Feb 5, 2026

Purpose

Linked issue: close #126

Background and Motivation:

Currently, due to the Memory Pool Adaptor implementation, the lifecycle of BatchReader must outlive the ReadBatch it returns. Otherwise, batch destruction will cause dangling pointer access. Specifically, to adapt the Paimon memory pool to Arrow or ORC memory pool interfaces, the BatchReader internally wraps the incoming Paimon memory pool using an adaptor and passes it to Arrow or ORC APIs. As a result, the memory pool object referenced by PoolBuffer inside the batch is not the original memory pool object directly passed to the Paimon API, but rather the adaptor object held internally by the BatchReader.

This requires users to keep the BatchReader alive until all returned batches (specifically, the internal PoolBuffers) are released. This constraint couples Reader and Batch in a way that is unintuitive and error-prone for users, and may also limit certain resource optimization techniques.

Solution:

To address this issue, this PR refactors the memory pool adaptor pattern by placing the adaptor directly into the Paimon memory pool, ensuring they share consistent lifecycles. This allows readers to use the adaptors directly without creating and holding them independently. Adaptor creation is lazy/one-time.

This refactoring doesn't prevent existing plugins from continuing to use the original pattern—customizing their own adapters and having readers hold them directly. The downside is that this reintroduces coupling between the plugin's reader and batch components. So if you don't mind the coupling, existing plugins should work as-is without any changes.

Tests

Existing tests cover this change. Additionally, manual testing has verified the decoupling of reader and batch lifecycles.

API and Format

This change introduce a new public methods to the memory pool: AsSpecifiedMemoryPool, returns a pointer to the specified adaptor.
Additionally, introduces MemoryPoolAdaptor<T> as a CRTP base class for implementing custom memory pool adaptors. Subclasses must:

  • Inherit from MemoryPoolAdaptor<Subclass>
  • Implement a static Identifier() method
  • Provide a constructor accepting MemoryPool&

The change maintains forward compatibility.

Documentation

No new documentation required.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the memory pool adaptor pattern to address a critical lifecycle coupling issue between BatchReader and ReadBatch. Previously, adaptors were created and held by readers, requiring readers to outlive the batches they returned to prevent dangling pointer access. The refactoring moves adaptor ownership into the MemoryPool itself, ensuring consistent lifecycles and enabling batches to safely outlive readers.

Changes:

  • Introduces AsArrowMemoryPool() and AsOrcMemoryPool() methods to MemoryPool for lazy, thread-safe adaptor creation
  • Implements factory pattern for adaptor creation with support for pluggable format libraries
  • Refactors all Arrow and ORC format code to use new memory pool API instead of creating separate adaptors
  • Removes approximately 100+ adaptor-related member variables and includes across the codebase

Reviewed changes

Copilot reviewed 83 out of 83 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
include/paimon/memory/memory_pool.h Adds AsArrowMemoryPool() and AsOrcMemoryPool() methods with thread-safe lazy initialization
src/paimon/common/memory/memory_pool.cpp Implements adaptor holder and lazy creation logic using std::call_once
src/paimon/common/memory/memory_pool_adaptor_*.{h,cpp} New infrastructure for factory pattern and adaptor management
src/paimon/common/utils/arrow/arrow_memory_pool_adaptor*.{h,cpp} Simplifies Arrow adaptor to take reference instead of shared_ptr
src/paimon/format/orc/orc_memory_pool_adaptor*.{h,cpp} Simplifies ORC adaptor to take reference instead of shared_ptr
src/paimon/format/parquet/*.cpp Updates all Parquet code to use pool->AsArrowMemoryPool()
src/paimon/format/orc/*.cpp Updates all ORC code to use pool->AsOrcMemoryPool()
src/paimon/format/avro/*.cpp Updates Avro code to use new memory pool API
src/paimon/format/blob/*.cpp Updates Blob format code to use new memory pool API
src/paimon/core/**/*.cpp Updates core components to use new memory pool API
src/paimon/common/reader/*.cpp Updates reader infrastructure to use new memory pool API
examples/read_write_demo.cpp Demonstrates new decoupling by explicitly closing reader before using batches
Comments suppressed due to low confidence (1)

src/paimon/common/utils/arrow/arrow_memory_pool_adaptor_factor.h:2

  • The copyright year in the header is listed as 2026, but the current date is February 2026. This should likely be "2024-present" or "2025-present" to match the existing codebase pattern, unless this file was genuinely created in 2026. Please verify and update the copyright year to be consistent with the actual file creation date and codebase conventions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -0,0 +1,30 @@
/*
* Copyright 2026-present Alibaba Inc.
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copyright year in the header is listed as 2026, but the current date is February 2026. This should likely be "2024-present" or "2025-present" to match the existing codebase pattern, unless this file was genuinely created in 2026. Please verify and update the copyright year to be consistent with the actual file creation date and codebase conventions.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,34 @@
/*
* Copyright 2026-present Alibaba Inc.
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copyright year in the header is listed as 2026, but the current date is February 2026. This should likely be "2024-present" or "2025-present" to match the existing codebase pattern, unless this file was genuinely created in 2026. Please verify and update the copyright year to be consistent with the actual file creation date and codebase conventions.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,34 @@
/*
* Copyright 2026-present Alibaba Inc.
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copyright year in the header is listed as 2026, but the current date is February 2026. This should likely be "2024-present" or "2025-present" to match the existing codebase pattern, unless this file was genuinely created in 2026. Please verify and update the copyright year to be consistent with the actual file creation date and codebase conventions.

Suggested change
* Copyright 2026-present Alibaba Inc.
* Copyright 2024-present Alibaba Inc.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,43 @@
/*
* Copyright 2026-present Alibaba Inc.
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copyright year in the header is listed as 2026, but the current date is February 2026. This should likely be "2024-present" or "2025-present" to match the existing codebase pattern, unless this file was genuinely created in 2026. Please verify and update the copyright year to be consistent with the actual file creation date and codebase conventions.

Suggested change
* Copyright 2026-present Alibaba Inc.
* Copyright 2024-present Alibaba Inc.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,35 @@
/*
* Copyright 2026-present Alibaba Inc.
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copyright year in the header is listed as 2026, but the current date is February 2026. This should likely be "2024-present" or "2025-present" to match the existing codebase pattern, unless this file was genuinely created in 2026. Please verify and update the copyright year to be consistent with the actual file creation date and codebase conventions.

Suggested change
* Copyright 2026-present Alibaba Inc.
* Copyright 2024-present Alibaba Inc.

Copilot uses AI. Check for mistakes.
auto orc_pool = std::make_shared<OrcMemoryPool>(pool);
std::shared_ptr<arrow::MemoryPool> arrow_pool = GetArrowPool(pool);
reader_options.setMemoryPool(*orc_pool);
reader_options.setMemoryPool(*pool->AsOrcMemoryPool());
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code dereferences the result of AsOrcMemoryPool() without checking for nullptr. According to the documentation in memory_pool.h, AsOrcMemoryPool() can return nullptr if the ORC format library is not linked. This dereferencing will cause undefined behavior (likely a crash) if the ORC adaptor factory is not registered. Consider adding a nullptr check and returning an appropriate error if the ORC memory pool is unavailable.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,43 @@
/*
* Copyright 2026-present Alibaba Inc.
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copyright year in the header is listed as 2026, but the current date is February 2026. This should likely be "2024-present" or "2025-present" to match the existing codebase pattern, unless this file was genuinely created in 2026. Please verify and update the copyright year to be consistent with the actual file creation date and codebase conventions.

Suggested change
* Copyright 2026-present Alibaba Inc.
* Copyright 2025-present Alibaba Inc.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,43 @@
/*
* Copyright 2026-present Alibaba Inc.
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copyright year in the header is listed as 2026, but the current date is February 2026. This should likely be "2024-present" or "2025-present" to match the existing codebase pattern, unless this file was genuinely created in 2026. Please verify and update the copyright year to be consistent with the actual file creation date and codebase conventions.

Suggested change
* Copyright 2026-present Alibaba Inc.
* Copyright 2024-present Alibaba Inc.

Copilot uses AI. Check for mistakes.
::orc::ReaderOptions reader_options;
auto orc_pool = std::make_shared<OrcMemoryPool>(pool);
reader_options.setMemoryPool(*orc_pool);
reader_options.setMemoryPool(*pool->AsOrcMemoryPool());
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code dereferences the result of AsOrcMemoryPool() without checking for nullptr. According to the documentation in memory_pool.h, AsOrcMemoryPool() can return nullptr if the ORC format library is not linked. This dereferencing will cause undefined behavior (likely a crash) if the ORC adaptor factory is not registered. Consider adding a nullptr check and returning an appropriate error if the ORC memory pool is unavailable.

Copilot uses AI. Check for mistakes.
if (pool) {
orc_memory_pool = std::make_shared<OrcMemoryPool>(pool);
writer_options.setMemoryPool(orc_memory_pool.get());
writer_options.setMemoryPool(pool->AsOrcMemoryPool());
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code dereferences the result of AsOrcMemoryPool() without checking for nullptr. According to the documentation in memory_pool.h, AsOrcMemoryPool() can return nullptr if the ORC format library is not linked. This dereferencing will cause undefined behavior (likely a crash) if the ORC adaptor factory is not registered. Consider adding a nullptr check and returning an appropriate error if the ORC memory pool is unavailable.

Copilot uses AI. Check for mistakes.
/// If the ORC format library is not linked, this method returns nullptr.
/// @note Subclasses can override this method to provide custom ORC integration.
virtual ::orc::MemoryPool* AsOrcMemoryPool();

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory pool should not depend on plugin-specific types. Currently, it requires knowledge of concrete reader/format classes, which breaks modularity. A user's custom plugin shouldn't force changes in core memory headers. May need to refactor for better decoupling between memory management and data readers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! You're absolutely right that we shouldn't add plugin-specific memory pool interfaces to the core headers. So I refactored it into a more generic template method AsSpecifiedMemoryPool, while keeping the original factory registration mechanism intact. Individual plugins can then wrap this method with convenience functions to simplify usage, like this example for ORC:

inline ::orc::MemoryPool* AsOrcMemoryPool(paimon::MemoryPool& pool) {
    return pool.AsSpecifiedMemoryPool<::orc::MemoryPool>();
}

@lxy-9602 lxy-9602 requested a review from lucasfang February 6, 2026 09:08
@lxy-9602 lxy-9602 requested a review from lszskye February 6, 2026 09:08
///
/// @return Pointer to the Arrow memory pool adapter.
/// @note Subclasses can override this method to provide custom Arrow integration.
virtual arrow::MemoryPool* AsArrowMemoryPool();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paimon interface only has c arrow api, arrow::MemoryPool is C++ symbol which should not be in here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Decoupling BatchReader from returned batch lifecycles

3 participants