Skip to content

Comments

Consolidate load_bin somewhat#792

Open
hildebrandmw wants to merge 7 commits intomainfrom
mhildebr/load-bin
Open

Consolidate load_bin somewhat#792
hildebrandmw wants to merge 7 commits intomainfrom
mhildebr/load-bin

Conversation

@hildebrandmw
Copy link
Contributor

@hildebrandmw hildebrandmw commented Feb 22, 2026

I wanted a load_bin that:

  1. Returned a Matrix<T>
  2. Did not pull in diskann-providers.

In the process, I discovered we had two separate load_bin implementations (storage_utils.rs and file_util.rs).
The version in file_utils had several bugs:

  1. It double-allocated (creating a byte buffer, reading into the byte-buffer, and then copying out the results into a Vec<T>).
  2. The allocation of said byte-buffer is not guaranteed to be aligned to T, so the bytemuck cast could fail.

This PR replaces all of that with a single implementation in diskann-utils::io.

A tour

New in diskann-utils:

  • io module:

    • A Metadata struct with read/write inherent methods (replacing read_metadata and write_metadata).
      This struct has a checked constructor to validate the conversion to 32-bit integers.

    • read_bin and write_bin for the actual IO.
      These methods return and accept Matrix and MatrixView respectively.
      Furthermore, they take simple Read + Seek or Write types and begin reading/writing at the current position of the respective stream.

  • MatrixBase::column_vector: constructor mirroring the existing row_vector.

  • MatrixBase::map: element-wise transform producing a new Matrix (replaces convert_types).

  • MatrixBase::try_get: non-panicking tuple indexing.

  • Little-endian compile guard: Our handling of endianness throughout our reading and writing
    infrastructure is super inconsistent. My vote is to simply bail if we ever compile for a
    big-endian system and call it a day. I doubt we'll ever compile on PowerPC or MIPS.

Removed from diskann-providers:

  • file_util::load_bin (the double-allocating version)
  • storage_utils::Metadata / read_metadata / write_metadata (replaced by diskann_utils::io::Metadata)
  • save_bin! macro and generated save_bin_f32 / save_bin_u32 / save_bin_u64
  • convert_types helper (replaced by Matrix::map and slice iterators)

Thin wrappers retained in diskann-providers::utils:

  • read_bin_from(reader, offset) — seek + read_bin
  • write_bin_from(data, writer, offset) — seek + write_bin

These exist because some callers (PQ pivot files) need offset-based access.

pq_storage.rs rework (deserves a closer look)

The PQ pivot file has a multi-section layout: an offset table at position 0, then 3 contiguous data sections starting at METADATA_SIZE.
The old code used save_bin_f32 / save_bin_u32 / save_bin_u64 with explicit offsets for every section, even though the writes are sequential.

The new code makes the data flow explicit:

  • Write path: seeks once past the offset table gap, writes 3 sections sequentially with write_bin, then seeks back to offset 0 to write the now-known offset table with write_bin_from.
  • Read path: this is functionally unchanged. The offsets are still used to seek in the reader to the designated locations. I am not 100% sure there aren't files existing that have offsets such that the centroids are not packed, so it seems prudent to keep this consistent.

The same pattern is applied to compress_existing_data's read path. Type conversions (usizeu32/u64 for the offset table and chunk offsets) now use idiomatic iter().map().collect() instead of convert_types.

@hildebrandmw hildebrandmw marked this pull request as ready for review February 22, 2026 00:23
@hildebrandmw hildebrandmw requested review from a team and Copilot February 22, 2026 00:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request consolidates two separate load_bin implementations into a single, improved implementation in diskann-utils::io. The new implementation fixes critical bugs in the old version (double allocation and potential alignment issues) and returns a Matrix<T> instead of a tuple, providing a more ergonomic API.

Changes:

  • Adds new io module to diskann-utils with Metadata, read_bin, and write_bin functions
  • Adds utility methods to Matrix/MatrixView: column_vector, map, and try_get
  • Removes buggy load_bin implementation from diskann-providers and replaces all call sites with the new API
  • Reworks pq_storage.rs to use sequential writes/reads instead of multiple offset-based operations

Reviewed changes

Copilot reviewed 40 out of 43 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
diskann-utils/src/lib.rs Adds compile-time endianness check and exports new io module
diskann-utils/src/io.rs New module with Metadata struct, read_bin/write_bin functions, and comprehensive error types
diskann-utils/src/views.rs Adds column_vector constructor, map method, and try_get for non-panicking indexing
diskann-utils/Cargo.toml Adds bytemuck dependency with must_cast feature
diskann/src/error/ann_error.rs Adds From implementations for new error types and corresponding tests
diskann-providers/src/utils/storage_utils.rs Removes old Metadata/read_metadata/write_metadata and save_bin_* macros; keeps thin wrappers read_bin_from/write_bin_from
diskann-providers/src/utils/file_util.rs Removes double-allocating load_bin implementation
diskann-providers/src/utils/mod.rs Updates exports to reflect removed functions
diskann-providers/src/storage/pq_storage.rs Reworks write/read paths for clearer sequential data flow
diskann-providers/src/model/pq/mod.rs Removes convert_types helper
diskann-tools/* Updates all call sites to use new API
diskann-disk/* Updates all call sites to use new API
diskann-benchmark/* Updates all call sites to use new API
Cargo.lock Updates bytemuck dependency for diskann-utils

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if row >= self.nrows() || col >= self.ncols() {
None
} else {
// SAFETY: We just verified that `row` and `col~ are in-bounds.
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: "col~" should be "col"

Copilot uses AI. Check for mistakes.
if chunk_offsets_m.nrows() != *num_pq_chunks + 1 || chunk_offsets_m.ncols() != 1 {
return Err(ANNError::log_pq_error(format_args!(
"Error reading pq_pivots file at chunk offsets; \
file has nr={}, nc={} but expecting nr={} and nc=2.",
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message says "expecting nr={} and nc=2" but should say "nc=1" since the check on line 240 verifies that chunk_offsets_m.ncols() != 1.

Suggested change
file has nr={}, nc={} but expecting nr={} and nc=2.",
file has nr={}, nc={} but expecting nr={} and nc=1.",

Copilot uses AI. Check for mistakes.
@codecov-commenter
Copy link

codecov-commenter commented Feb 22, 2026

Codecov Report

❌ Patch coverage is 86.71756% with 87 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.13%. Comparing base (6f8e8eb) to head (b20f8ee).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
diskann-providers/src/storage/pq_storage.rs 76.31% 18 Missing ⚠️
...ann-providers/src/model/pq/fixed_chunk_pq_table.rs 64.51% 11 Missing ⚠️
diskann-tools/src/utils/cmd_tool_error.rs 0.00% 11 Missing ⚠️
diskann-tools/src/bin/generate_minmax.rs 0.00% 10 Missing ⚠️
diskann-tools/src/utils/search_disk_index.rs 0.00% 7 Missing ⚠️
diskann-tools/src/utils/search_index_utils.rs 0.00% 6 Missing ⚠️
diskann-utils/src/io.rs 96.25% 6 Missing ⚠️
diskann-tools/src/utils/ground_truth.rs 58.33% 5 Missing ⚠️
diskann-tools/src/utils/build_disk_index.rs 0.00% 4 Missing ⚠️
diskann-providers/src/storage/bin.rs 50.00% 2 Missing ⚠️
... and 5 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #792      +/-   ##
==========================================
- Coverage   89.16%   89.13%   -0.04%     
==========================================
  Files         431      431              
  Lines       78895    78961      +66     
==========================================
+ Hits        70343    70378      +35     
- Misses       8552     8583      +31     
Flag Coverage Δ
miri 89.13% <86.71%> (-0.04%) ⬇️
unittests 89.13% <86.71%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
diskann-benchmark/src/inputs/async_.rs 37.72% <100.00%> (ø)
diskann-benchmark/src/utils/datafiles.rs 83.58% <100.00%> (ø)
diskann-disk/src/build/builder/build.rs 94.16% <100.00%> (ø)
diskann-disk/src/search/provider/disk_provider.rs 91.01% <100.00%> (-0.01%) ⬇️
...n-disk/src/search/provider/disk_vertex_provider.rs 87.30% <100.00%> (ø)
diskann-disk/src/storage/quant/generator.rs 92.72% <100.00%> (ø)
diskann-disk/src/storage/quant/pq/pq_generation.rs 93.33% <ø> (ø)
diskann-providers/src/index/diskann_async.rs 96.35% <100.00%> (-0.01%) ⬇️
diskann-providers/src/model/pq/pq_construction.rs 92.14% <100.00%> (-0.38%) ⬇️
diskann-providers/src/storage/index_storage.rs 99.71% <100.00%> (ø)
... and 25 more

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants