Conversation
There was a problem hiding this comment.
Pull request overview
This pull request consolidates two separate load_bin implementations into a single, improved implementation in diskann-utils::io. The new implementation fixes critical bugs in the old version (double allocation and potential alignment issues) and returns a Matrix<T> instead of a tuple, providing a more ergonomic API.
Changes:
- Adds new
iomodule todiskann-utilswithMetadata,read_bin, andwrite_binfunctions - Adds utility methods to
Matrix/MatrixView:column_vector,map, andtry_get - Removes buggy
load_binimplementation fromdiskann-providersand replaces all call sites with the new API - Reworks
pq_storage.rsto use sequential writes/reads instead of multiple offset-based operations
Reviewed changes
Copilot reviewed 40 out of 43 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| diskann-utils/src/lib.rs | Adds compile-time endianness check and exports new io module |
| diskann-utils/src/io.rs | New module with Metadata struct, read_bin/write_bin functions, and comprehensive error types |
| diskann-utils/src/views.rs | Adds column_vector constructor, map method, and try_get for non-panicking indexing |
| diskann-utils/Cargo.toml | Adds bytemuck dependency with must_cast feature |
| diskann/src/error/ann_error.rs | Adds From implementations for new error types and corresponding tests |
| diskann-providers/src/utils/storage_utils.rs | Removes old Metadata/read_metadata/write_metadata and save_bin_* macros; keeps thin wrappers read_bin_from/write_bin_from |
| diskann-providers/src/utils/file_util.rs | Removes double-allocating load_bin implementation |
| diskann-providers/src/utils/mod.rs | Updates exports to reflect removed functions |
| diskann-providers/src/storage/pq_storage.rs | Reworks write/read paths for clearer sequential data flow |
| diskann-providers/src/model/pq/mod.rs | Removes convert_types helper |
| diskann-tools/* | Updates all call sites to use new API |
| diskann-disk/* | Updates all call sites to use new API |
| diskann-benchmark/* | Updates all call sites to use new API |
| Cargo.lock | Updates bytemuck dependency for diskann-utils |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
diskann-utils/src/views.rs
Outdated
| if row >= self.nrows() || col >= self.ncols() { | ||
| None | ||
| } else { | ||
| // SAFETY: We just verified that `row` and `col~ are in-bounds. |
There was a problem hiding this comment.
Typo in comment: "col~" should be "col"
| if chunk_offsets_m.nrows() != *num_pq_chunks + 1 || chunk_offsets_m.ncols() != 1 { | ||
| return Err(ANNError::log_pq_error(format_args!( | ||
| "Error reading pq_pivots file at chunk offsets; \ | ||
| file has nr={}, nc={} but expecting nr={} and nc=2.", |
There was a problem hiding this comment.
The error message says "expecting nr={} and nc=2" but should say "nc=1" since the check on line 240 verifies that chunk_offsets_m.ncols() != 1.
| file has nr={}, nc={} but expecting nr={} and nc=2.", | |
| file has nr={}, nc={} but expecting nr={} and nc=1.", |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #792 +/- ##
==========================================
- Coverage 89.16% 89.13% -0.04%
==========================================
Files 431 431
Lines 78895 78961 +66
==========================================
+ Hits 70343 70378 +35
- Misses 8552 8583 +31
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
I wanted a
load_binthat:Matrix<T>diskann-providers.In the process, I discovered we had two separate
load_binimplementations (storage_utils.rsandfile_util.rs).The version in
file_utilshad several bugs:Vec<T>).T, so the bytemuck cast could fail.This PR replaces all of that with a single implementation in
diskann-utils::io.A tour
New in
diskann-utils:iomodule:A
Metadatastruct withread/writeinherent methods (replacingread_metadataandwrite_metadata).This struct has a checked constructor to validate the conversion to 32-bit integers.
read_binandwrite_binfor the actual IO.These methods return and accept
MatrixandMatrixViewrespectively.Furthermore, they take simple
Read + SeekorWritetypes and begin reading/writing at the current position of the respective stream.MatrixBase::column_vector: constructor mirroring the existingrow_vector.MatrixBase::map: element-wise transform producing a new Matrix (replacesconvert_types).MatrixBase::try_get: non-panicking tuple indexing.Little-endian compile guard: Our handling of endianness throughout our reading and writing
infrastructure is super inconsistent. My vote is to simply bail if we ever compile for a
big-endian system and call it a day. I doubt we'll ever compile on PowerPC or MIPS.
Removed from
diskann-providers:file_util::load_bin(the double-allocating version)storage_utils::Metadata/read_metadata/write_metadata(replaced bydiskann_utils::io::Metadata)save_bin!macro and generatedsave_bin_f32/save_bin_u32/save_bin_u64convert_typeshelper (replaced byMatrix::mapand slice iterators)Thin wrappers retained in
diskann-providers::utils:read_bin_from(reader, offset)— seek + read_binwrite_bin_from(data, writer, offset)— seek + write_binThese exist because some callers (PQ pivot files) need offset-based access.
pq_storage.rs rework (deserves a closer look)
The PQ pivot file has a multi-section layout: an offset table at position 0, then 3 contiguous data sections starting at
METADATA_SIZE.The old code used
save_bin_f32/save_bin_u32/save_bin_u64with explicit offsets for every section, even though the writes are sequential.The new code makes the data flow explicit:
write_bin, then seeks back to offset 0 to write the now-known offset table withwrite_bin_from.The same pattern is applied to
compress_existing_data's read path. Type conversions (usize→u32/u64for the offset table and chunk offsets) now use idiomaticiter().map().collect()instead ofconvert_types.