Postpone index build for empty table to first insert#209
Closed
therealdarkknight wants to merge 5 commits intolanterndata:mainfrom
Closed
Postpone index build for empty table to first insert#209therealdarkknight wants to merge 5 commits intolanterndata:mainfrom
therealdarkknight wants to merge 5 commits intolanterndata:mainfrom
Conversation
Collaborator
|
Thank you @therealdarkknight ! Will there be any problems with that approach or does postponing the index build have more advantages, what do you think? cc: @Ngalstyan4 |
Contributor
|
I agree with @var77's suggestion. That makes things cleaner. Let's postpone this for now tho and concentrate on other things first. |
Contributor
|
Superseded by #251 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #198
In the event of an index being built on an empty table where no dimension is specified in the options (using
CREATE INDEX... WITH (dim=k)), we postpone the creation of this index to the first insertion, where we can get the dimension of the inserted vector and then build the index. This improves user UX as outlined in the issue.I updated tests, and expanded on the
hnsw_createtest to test the above functionality.Notes/Observations:
0. One thing to consider is what might happen when we have concurrent inserts/calls to aminsert when the index build is postponed. We don't want to trigger the index build more than once accidentally. Perhaps we should pass a Boolean state that marks the progress of a postponed index build and check against it? I have yet to look into Postgres' concurrency model. Feel free to share ideas.
The
BuildCallbackimmediately returns after the first tuple is read (explained in the comment there). This was done for simplicity's sake. But this also means that postgres is pointlessly calling this callback for every other tuple in the heap that exists at that time, since each callback after the first one will immediately return. So a future optimization may be, on postponed builds, to only read the first row and write it to the index without performing the full heap scan with the callback. However, note that this is only relevant in a specific scenario: when we construct an index on an empty table, specify no dimension, and do a bulk insert of some sort as part of the first insert (like\COPYon acsvfile), which is when the heap would have more than one tuple at the time of building the postponed index. Since this scenario is pretty rare, we already have a simple implementation, and we are not guaranteed significant performance boosts if we opt for the alternative, I think that the current approach is well worth it.I noticed that the original codebase, when inferring a dimension (via the
GetArrayLengthFromHeapmethod), we ditch inference entirely upon encountering aNULLvector. This means that we're not able to inference when the first encountered tuple isNULLbut there are other non-NULL tuples. Wanted to point this out-- I don't think it would be a hard change: we can just keep going until we find a non-NULL vector or reach the end.I believe some of the code in
GetHnswIndexDimensionscan be refactored. It basically tries to do the same thing asInferDimension, and both useGetArrayLength. Idea is to haveGetHnswIndexDimensionsjust callInferDimensionand then refactor theInitBuildStateas well, since it is currently calling both of these methods under certain circumstances which leads to the same operation effectively being done twice. But the way thatGetHnswIndexDimensionsandInferDimensionuseGetArraylengthis also slightly different. Is the relevant part ofGetHnswIndexDimensionsandInferDimensiontrying to accomplish the same thing? Curious to hear thoughts on this.