Skip to content

[WIP] CLI program to write ALP encode example parquet files#49154

Draft
alamb wants to merge 54 commits intoapache:mainfrom
alamb:alamb/example_encoding_writer
Draft

[WIP] CLI program to write ALP encode example parquet files#49154
alamb wants to merge 54 commits intoapache:mainfrom
alamb:alamb/example_encoding_writer

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Feb 5, 2026

This builds on the following PR from @prtkgaur

It contains a binary that creates files using the new ALP encoding here:

I don't intend to merge this PR, rather I plan to use it to create test parquet files, and am posting the PR in case anyone else is interested.

To build

  cd arrow/cpp
  cmake -S . -B build -DARROW_PARQUET=ON -DPARQUET_BUILD_EXAMPLES=ON \
    -DCMAKE_POLICY_VERSION_MINIMUM=3.5 \
    -DARROW_MIMALLOC=OFF -DARROW_SIMD_LEVEL=NONE -DARROW_RUNTIME_SIMD_LEVEL=NONE
  MAKEFLAGS=-j8 cmake --build build --target parquet-write-parquet

To run

cd arrow/cpp
./build/release/parquet-write-parquet --encoding ALP  /tmp

This writes a file like this to /tmp: single_f64_ALP.zip

TODO: make sure the following patterns, from the spec, are covered:

  1. pages with no exceptions
  2. encoding w/ exceptiosn and NAN, INF, etc
  3. multiple ALP vector sizes (1 -> 15 == 65k)
  4. Both f32 and f64 variants

sfc-gh-pgaur and others added 23 commits January 12, 2026 15:59
Set up base to allow metadata to be before data
This will aloow for easy future extensibility
This commit changes the ALP page layout from grouped metadata to an
offset-based interleaved layout:

OLD (grouped metadata):
  [Header][AlpInfos...][ForInfos...][Data...]

NEW (offset-based):
  [Header][Offsets...][Vector0][Vector1]...
  where each Vector = [AlpInfo|ForInfo|Data]

Benefits:
- O(1) random access to any vector (previously O(n) to compute offsets)
- Better locality for single-vector decompression
- Enables parallel decompression without coordination
- Storage overhead: 4 bytes per vector (~0.4% for typical pages)

Changes:
- alp_constants.h: Add OffsetType (uint32_t) for vector offsets
- alp_wrapper.cc: EncodeAlp writes offsets + interleaved vectors
- alp_wrapper.cc: DecodeAlp reads offsets and jumps directly to vectors
- alp_wrapper.cc: GetMaxCompressedSize accounts for offset storage
- alp_test.cc: Fix duplicate test name (EmptyInput -> EmptyInputViaCompression)
Update specification and code comments to document the new page layout:
- alp.h: Updated AlpEncodedVector class documentation
- ALP_Encoding_Specification.md: Updated page layout diagrams and descriptions
Add the ability to pre-compute sampling presets and reuse them for
encoding. This is useful for:
- Pre-computing presets outside benchmark loops (removes sampling overhead)
- Reusing presets across multiple batches with similar data characteristics

New public methods in AlpWrapper:
- CreateSamplingPreset(): Samples input data and returns a preset
- EncodeWithPreset(): Encodes using a pre-computed preset

The existing Encode() method now delegates to these new methods internally.

Also adds tests:
- EncodeWithPreset: Verifies preset-based encoding produces identical output
- PresetReuseAcrossBatches: Verifies preset can be reused for multiple encodes
- alp.h: Update serialization flow comment, AlpEncodedVectorView comments
- alp.h: Mark AlpMetadataCache as LEGACY (not used with offset layout)
- alp_wrapper.cc: Update AlpHeader page layout comment
- ALP_Encoding_Specification.md: Fix vector size to be configurable
- ALP_Encoding_Specification_terse.md: Complete rewrite for offset layout

With the offset-based layout, AlpMetadataCache is no longer needed since
the offset array provides O(1) random access directly.
@github-actions
Copy link

github-actions bot commented Feb 5, 2026

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@alamb alamb changed the title Alamb/example encoding writer [WIP] CLI program to write ALP encode example parquet files Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants