Conversation
|
I've been working on testing this with @dskvr , have some feedback: My relay has ~20m events, so this is a good test of the indexing functionality. We ran into some troubles with indexing (it stalled out after ~8m events), so @dskvr added some additional improvements in I started indexing the db with this config: Indexing started running great, I came back this morning and my logs are getting spammed with:
Where the counter never increments. It just keeps sending this same log over and over. I tried It's possible this is a red herring log, where because of my My On the bright side, query performance is great. Querying I think this PR is on the right track here, just needs some tweaking on rebuilding the index on large datasets. Just my 2 sats. |
|
This is very impressive, thank you! Sounds like more testing is necessary, but yes this looks broadly like it's on the right track. |
|
@leesalminen Are you still running this branch, if so, any issues other than the debugging output bug? @hoytech The only issue I am aware of is that when interrupting an indexing operation, the debugging output is not correct when resuming. Also, there needs to be a method to destroy the index. |
@leesalminen Bugs you reported in |
Overview
This PR implements NIP-50 (Search Capability) for strfry, enabling full-text search across Nostr events using BM25 ranking. The implementation includes:
Architecture
Core Components
Search Provider Interface (
src/search/SearchProvider.h)LMDB Search Backend (
src/search/LmdbSearchProvider.h)Background Indexer (in
LmdbSearchProvider::runCatchupIndexer())SearchState.lastIndexedLevIdSearch Runner (
src/search/SearchRunner.h)Database Schema
New LMDB tables (defined in
golpe.yaml):Configuration
Key settings in
strfry.conf(relay.search):Supported
candidateRankingorders (desc for each component):terms-tf-recency(default)terms-recency-tftf-terms-recencytf-recency-termsrecency-terms-tfrecency-tf-termsConfiguration Parameters
enabled: Master switch for search functionalitybackend: Search provider implementation ("lmdb" or "noop")indexedKinds: Pattern of kinds to index (numbers/ranges/*/exclusions)maxQueryTerms: Maximum query terms parsedmaxPostingsPerToken: Max postings per token key (upper bound during fetch; pruning TBD)maxCandidateDocs: Maximum candidates for scoringoverfetchFactor: Candidate over-fetch before post-filteringrecencyBoostPercent: Recency tie-breaker percent (0–100; 1 = 1%)candidateRankMode:orderorweightedcandidateRanking: Order used when mode=order(list above)rankWeightTerms/rankWeightTf/rankWeightRecency: Weights for mode=weightedUsage
Enabling Search
Build strfry:
make -j$(nproc)Update
strfry.conf:Start strfry:
Indexing behavior:
Search Queries
Clients can issue NIP-50 search queries using the
searchfilter field:{ "kinds": [1], "search": "bitcoin lightning network", "limit": 100 }Search features:
Monitoring
Background indexer logs:
Query metrics include search-specific timings when
relay.logging.dbScanPerf = true(scan=Search).Performance Characteristics
Indexing Performance
Query Performance
maxCandidateDocsand result set sizeTuning guidelines:
maxCandidateDocsfor faster queries with slightly lower recalloverfetchFactorto improve recall for multi-token queriesBenchmark Suite
Put something together for benchmarks, but didn't finish. Will likely remove it before marking ready for review
A comprehensive benchmark suite is included under `bench/`:Running Benchmarks
Prepare a test database:
This generates cryptographically valid Nostr events using
nakand ingests them into a fresh database.Run the benchmark:
bench/scripts/run.sh -s scenarios/small.yml --out bench/results/raw/small-$(date +%s)Generate reports:
Benchmark Metrics
Testing
Manual Testing
Index a test database:
Issue search queries via WebSocket:
Verify results are returned in relevance order
Integration Points
DBQuery.h: Search queries execute alongside traditional index scansActiveMonitors.h: Search filters excluded from live subscription indexes (one-shot queries)QueryScheduler.h: Search provider injected into query execution pathcmd_relay.cpp: Background indexer lifecycle managementMigration Notes
Existing Databases
For existing strfry installations:
cd golpe && ./build.sh && cd .. && makeThe indexer will automatically catch up on all existing events. Monitor logs for progress.
Rollback
To disable search without data loss:
relay.search.enabled = falsein configThe search tables remain in the database but are not used. They can be manually removed using the
mdbcommand-line tools if desired.Known Limitations
contentfield of events (does not index tags or metadata)maxCandidateDocsfor optimal performanceFuture Enhancements
Potential improvements for future iterations:
Related Issues