Add `row_schema` field to `SpaceInfo` by a-rodin · Pull Request #18 · pathscale/DataBucket

a-rodin · 2024-12-05T12:02:22Z

This PR is a part of work on pathscale/WorkTable#24.

Handy-caT · 2024-12-24T19:04:58Z

src/page/util.rs

+    let mut result: Vec<IndexValue<T>> = vec![];
+    for interval in intervals.iter() {
+        for index in interval.0..=interval.1 {
+            let mut index_records = parse_index_page::<T, PAGE_SIZE>(file, index as u32)?;


Hm, it looks like for every Link we are currently seeking at page start and re-reading header and then seeking to data start. I think it's not the best way to do that

Handy-caT · 2024-12-24T19:06:14Z

src/page/util.rs

+    Ok(result)
+}
+
+fn read_links<DataType, const PAGE_SIZE: usize>(


This part also looks suspicious. I think for big tables it will not work

Handy-caT · 2024-12-24T19:22:26Z

@a-rodin I think we can use iterators approach. You need to implement custom ones for this task. I think there will be two iterators: index iterator and data iterator.

Index iterator will load one index page and yield links. When page is read then next page is loaded into memory and parsed. So one page load will give data for many next calls.

For data iterator you need to add fn that will read multiple data records from single page. It will have vec of Links as argument instead of single Link. This array will be already sorted, so this fn must be used only with array of sorted Link or panic at runtime. In this case we will parse header only once and then one by one will read data records seeking only forward.

Data iterator is harder as I think. The issue now we have is frequent re-read of data and too much seeks. I think data iterator can have next logic. It will use index iterator and will get vec of Links from full page. Then we need to sort Link by their's from file start offset. In this case our seeks are always front moved, but we will get unsorted data. For solving this we can use same approach as for index: load full page of data into memory and yield it untill we need to load next one. So as result flow is:

load index page Link
allocate vector for data
sort Links vector with saving original order
divide Links by pages
use fn from previous part to read data records from page
collect data records to buffer array
yield data from buffer until it ends

a-rodin · 2024-12-25T15:19:28Z

@Handy-caT I have added three iterators, PageIterator, LinkIterator, and DataIterator.

the2pizza

Looks good to me with small notes

the2pizza · 2025-01-17T21:28:31Z

src/page/util.rs

-use super::{Interval, SpaceInfo};
-
 pub fn map_index_pages_to_general<T>(pages: Vec<IndexData<T>>) -> Vec<General<IndexData<T>>> {
    let mut header = &mut GeneralHeader::new(0.into(), PageType::Index, 0.into());


By my compiler warnings it shouldn't be mut

the2pizza · 2025-01-17T22:01:35Z

src/page/util.rs

-        for index in interval.0..interval.1 {
-            let index_page = parse_index_page::<T, PAGE_SIZE>(file, index as u32)?;
-            result.push(index_page.inner);
+        for index in interval.0..=interval.1 {


I'm a bit doubt when I see nested loops. For improving performance I suppose we can merge intervals to reduce amount of iterations.

a-rodin and others added 30 commits December 5, 2024 15:00

Add row_schema field to SpaceInfo

38eba34

WIP reading of fields

927c95f

Store column data types as strings

b8c9124

Rename secondary_index_map to secondary_index_types

582a2ac

Merge branch 'main' into row_schema

ddd7de2

Add a function that reads arbitrary archived structs

858aaa7

Support more types and correct padding in parse_archived_row function

cff4ebf

Make code more DRY

ed0f458

Support f64 and f32 data types

b17598e

Make usage of commas in the match arms more consistent

7d839e3

Rename the test for parse_archived_row

276b5bc

Merge branch 'main' into parse_rkyv_data

ff90def

Add primary_key_type field

5e606b8

Change primary_key_type to primary_key_fields in SpaceInfo

fc147ac

Fix test_as_bytes test

3ef34f0

Storea and read vectors of index records in index pages

6ca08cc

WIP reading of data pages

cc1e313

Merge branch 'parse_rkyv_data' into row_schema

6899c9d

An implementation of reading rows from the database

e719f0b

corrections

6f62940

Implement DataType for numerical types

7cfd277

Support all data types

aaf7d45

Run cargo fmt

ea2529d

Remove unused imports

0ba0db1

Start implementing a test for reading row data

ac882f9

Create a mock database for test_read_table_data test

cedd6b0

Read vectors of IndexValue instead of vectors of IndexData

e4b3386

Make test_read_table_data test pass

878f061

Merge branch 'main' into row_schema

de423bb

Run cargo fmt

cf90eda

a-rodin added 6 commits December 23, 2024 17:48

Add parse_data_page function

bdf019f

Support more primary key data types

e9ffb39

Make intervals closed

6cc2530

Fix a broken test

4b9c15c

Add read_rows_schema function

9d22e37

Add Display trait to DataValueType

6dd3b95

Handy-caT reviewed Dec 24, 2024

View reviewed changes

a-rodin added 5 commits December 25, 2024 10:23

Add PageIterator

94f25fd

Add LinkIterator

0fc7a8a

Use absolute seek instead of relative seeks

3de33c1

Add DataIterator and a test for it

5db7cb2

Remove an unused import

a8a2412

a-rodin added 5 commits December 27, 2024 13:42

Infer the type of the primary key from SpaceInfo

b46ff54

Make the code more DRY

1812810

Remove unused imports and unneeded mut

7a49d0d

Merge branch 'main' into row_schema

27ff08d

Merge branch main

c52f5b8

the2pizza approved these changes Jan 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `row_schema` field to `SpaceInfo`#18

Add `row_schema` field to `SpaceInfo`#18
a-rodin wants to merge 46 commits intomainfrom
row_schema

a-rodin commented Dec 5, 2024 •

edited

Loading

Uh oh!

Handy-caT Dec 24, 2024

Uh oh!

Handy-caT Dec 24, 2024

Uh oh!

Handy-caT commented Dec 24, 2024

Uh oh!

a-rodin commented Dec 25, 2024

Uh oh!

the2pizza left a comment

Uh oh!

the2pizza Jan 17, 2025

Uh oh!

the2pizza Jan 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

a-rodin commented Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Handy-caT Dec 24, 2024

Choose a reason for hiding this comment

Uh oh!

Handy-caT Dec 24, 2024

Choose a reason for hiding this comment

Uh oh!

Handy-caT commented Dec 24, 2024

Uh oh!

a-rodin commented Dec 25, 2024

Uh oh!

the2pizza left a comment

Choose a reason for hiding this comment

Uh oh!

the2pizza Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

the2pizza Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

a-rodin commented Dec 5, 2024 •

edited

Loading