Postpone index build for empty table to first insert by therealdarkknight · Pull Request #209 · lanterndata/lantern

therealdarkknight · 2023-10-19T04:50:05Z

Fixes #198

In the event of an index being built on an empty table where no dimension is specified in the options (using CREATE INDEX... WITH (dim=k)), we postpone the creation of this index to the first insertion, where we can get the dimension of the inserted vector and then build the index. This improves user UX as outlined in the issue.

I updated tests, and expanded on the hnsw_create test to test the above functionality.

Notes/Observations:
0. One thing to consider is what might happen when we have concurrent inserts/calls to aminsert when the index build is postponed. We don't want to trigger the index build more than once accidentally. Perhaps we should pass a Boolean state that marks the progress of a postponed index build and check against it? I have yet to look into Postgres' concurrency model. Feel free to share ideas.

The BuildCallback immediately returns after the first tuple is read (explained in the comment there). This was done for simplicity's sake. But this also means that postgres is pointlessly calling this callback for every other tuple in the heap that exists at that time, since each callback after the first one will immediately return. So a future optimization may be, on postponed builds, to only read the first row and write it to the index without performing the full heap scan with the callback. However, note that this is only relevant in a specific scenario: when we construct an index on an empty table, specify no dimension, and do a bulk insert of some sort as part of the first insert (like \COPY on a csv file), which is when the heap would have more than one tuple at the time of building the postponed index. Since this scenario is pretty rare, we already have a simple implementation, and we are not guaranteed significant performance boosts if we opt for the alternative, I think that the current approach is well worth it.
I noticed that the original codebase, when inferring a dimension (via the GetArrayLengthFromHeap method), we ditch inference entirely upon encountering a NULL vector. This means that we're not able to inference when the first encountered tuple is NULL but there are other non-NULL tuples. Wanted to point this out-- I don't think it would be a hard change: we can just keep going until we find a non-NULL vector or reach the end.
I believe some of the code in GetHnswIndexDimensions can be refactored. It basically tries to do the same thing as InferDimension, and both use GetArrayLength. Idea is to have GetHnswIndexDimensions just call InferDimension and then refactor the InitBuildState as well, since it is currently calling both of these methods under certain circumstances which leads to the same operation effectively being done twice. But the way that GetHnswIndexDimensions and InferDimension use GetArraylength is also slightly different. Is the relevant part of GetHnswIndexDimensions and InferDimension trying to accomplish the same thing? Curious to hear thoughts on this.

var77 · 2023-10-27T12:43:03Z

Thank you @therealdarkknight !
I was curious, what if we build the index on empty table with let's say 1 dimension and then on the first insert if table is empty update the index header (with exclusive lock) by inferring the dimensions from the current tuple being inserted?

Will there be any problems with that approach or does postponing the index build have more advantages, what do you think?

cc: @Ngalstyan4

Ngalstyan4 · 2023-10-31T05:02:00Z

I agree with @var77's suggestion. That makes things cleaner.

Let's postpone this for now tho and concentrate on other things first.

Ngalstyan4 · 2024-01-13T23:59:52Z

Superseded by #251

therealdarkknight added 5 commits October 18, 2023 22:34

fixed all bugs and modified test outputs to a point wherethey're working

535e81d

cleaned up and formatted code

17dc848

expanded on hnsw_create test to test postponed index build functionality

c304a8a

postponing index build only when dimension is not specified now

788d19b

minor formatting

a8d7396

therealdarkknight mentioned this pull request Dec 19, 2023

Pick up index dimension on first insert for an empty table #251

Open

Ngalstyan4 closed this Jan 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Postpone index build for empty table to first insert#209

Postpone index build for empty table to first insert#209
therealdarkknight wants to merge 5 commits intolanterndata:mainfrom
therealdarkknight:build_index_on_first_insert_for_empty_table

therealdarkknight commented Oct 19, 2023 •

edited by Ngalstyan4

Loading

Uh oh!

var77 commented Oct 27, 2023

Uh oh!

Ngalstyan4 commented Oct 31, 2023

Uh oh!

Ngalstyan4 commented Jan 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

therealdarkknight commented Oct 19, 2023 • edited by Ngalstyan4 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

var77 commented Oct 27, 2023

Uh oh!

Ngalstyan4 commented Oct 31, 2023

Uh oh!

Ngalstyan4 commented Jan 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

therealdarkknight commented Oct 19, 2023 •

edited by Ngalstyan4

Loading