Sourcecode of the paper Bespoke OLAP: Synthesizing Workload-Specific One-size-fits-one Database Engines
The generated Cpp Artifacts of Bespoke-TPCH and Bespoke-CEB can be found here https://github.com/DataManagementLab/BespokeOLAP_Artifacts.
An LLM agent that automatically generates and optimizes custom C++ OLAP query engines for user specified workloads. The agent generates C++ code, compiles it, and iteratively improves performance through sophisticaed optimization loops. Results are tracked in Weights & Biases (wandb).
Bespoke OLAP: Synthesizing Workload-Specific Database Engines
Performance improvements across workloads
Statistics of the generated engines and optimization runs can be found in this public wandb project.
- Storage plan generation — the agent designs a custom data layout for the target workload
- Base implementation — the agent generates a complete C++ engine (loader, builder, query executors)
- Optimization loop — the agent iteratively improves performance, guided by speedup metrics and automatic validation against DuckDB reference results
The generated engine uses a hot-reload architecture: loader, builder, and query executors are compiled as shared libraries and reloaded without restarting the host process.
- Linux (x86-64)
- C++ toolchain (
gcc/clang) - Python 3.10+
uvpackage manager- Apache Arrow and Parquet development libraries
curl -LsSf https://astral.sh/uv/install.sh | shOptional: Setup Weights & Biases (wandb) account and API key for experiment tracking. You can sign up for free at https://wandb.ai/.
wget https://packages.apache.org/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
sudo apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
sudo apt update
sudo apt install -y libarrow-dev libparquet-dev parquet-toolsuv syncCreate a .env file with your API keys:
OPENAI_API_KEY=...
WANDB_ENTITY=... # Optional, e.g. "my-team"
WANDB_PROJECT=... # Optional, e.g. "bespoke-olap"Place TPC-H or CEB Parquet files in your artifacts directory (default: /mnt/labstore/bespoke_olap/). The path can be overridden with --base_parquet_dir.
source .venv/bin/activateThe conversation name resembles: storageplan{q_id}-{q_id}v{version}. For example, storageplan1-22v1 is a storage plan generated for TPC-H queries 1 and 22, version 1.
# TPC-H
python run_gen_storage_plan.py \
--conv storageplan1-22v1 \
--benchmark tpch \
--auto_u --auto_finish
# CEB
python run_gen_storage_plan.py \
--conv storageplan1a-11bv1 \
--benchmark ceb \
--auto_u --auto_finish(Optional) --auto_u and --auto_finish flags can be used to automatically approve prompts and finish the conversation when no more prompts remain, enabling a fully automated run. Use with caution, as it will skip all user interactions.
# TPC-H
python run_gen_base_impl.py \
--conv basef1-22v1 \
--benchmark tpch \
--auto_u --auto_finish
# CEB
python run_gen_base_impl.py \
--conv basef1a-11bv1 \
--benchmark ceb \
--auto_u --auto_finishConv name represents: basef{q_id}-{q_id}v{version}. For example, basef1-22v1 is a base implementation generated for TPC-H queries 1 and 22, version 1.
To run the optimizaiton loop, please specify the wandb run-id of the run producing the base implementation (see 3.).
The script will automatically look up the final snapshot created at the end of that conversation and load this git snapshot automatically.
I.e. any past run can be loaded as a starting point for the optimization loop, as long as the final snapshot of that run is available in the git cache. This allows you to easily continue and optimize from any past run, or even share runs across machines by sharing the git snapshot cache (see "Remote snapshot cache" below).
Store the wandb run-id in the run_optim_loop.py header.
# TPC-H
python run_optim_loop.py \
--conv runoptim1-22v1 \
--bespoke_storage \
--benchmark tpch \
--auto_u --auto_finish
# CEB
python run_optim_loop.py \
--conv runoptim1a-11bv1 \
--bespoke_storage \
--benchmark ceb \
--auto_u --auto_finishConversation names are used to organize and track runs. They first create separate log-files but also identify traces, snapshots, and metrics in wandb. Further they reference the queries for which an engine is generated and optimized, as well as the version number for the generated engine. Hence they have to be unique - this is also enforced by the system. Usually naming convensions (conversation name prefixes) are enforced by the scripts.
python main.py manual \
--conv_name <name> \
--query_list <q_ids> \
--benchmark tpchpython -m benchmark --systems bespoke,duckdb --snapshots <hash1,hash2,...> --scale_factors 1,5,20 --benchmark tpchSee Benchmarking guide for details and additional examples.
Common arguments shared across entry points: (however the recommend to use the prepared scripts/steps listed above, which have the appropriate arguments pre-configured)
| Argument | Default | Description |
|---|---|---|
--benchmark |
tpch |
Benchmark to use (tpch or ceb). |
--conv / --conv_name |
(required) | Conversation name (auto-prefixed with benchmark name). |
--model |
gpt-5.2-codex |
LLM model ID to use. |
--artifacts_dir |
/mnt/labstore/... |
Directory for caches, logs, conversations, and Parquet data. |
--base_parquet_dir |
(artifacts_dir) | Base directory for Parquet files. |
--replay |
False |
Replay a prior run strictly from cache (fails on cache miss). |
--replay_cache |
False |
Reuse cached prompts; auto-advance until the first uncached LLM call. |
--auto_u |
False |
Auto-approve all prompts without user interaction. Use with caution. |
--auto_finish |
False |
Automatically finish when no more prompts remain in the conversation. Otherwise the user can continue prompting manually. |
--notify |
False |
Send a notification when the conversation requires user input. |
--disable_wandb |
False |
Skip wandb logging. |
--disable_repo_sync |
False |
Skip pushing snapshots to the remote git cache. |
--disable_valtool |
False |
Disable automatic validation after each compile+run. |
--start_snapshot |
None |
Git snapshot hash to start from. |
--storage_plan_snapshot |
None |
Git snapshot hash to load a storage plan from. |
--continue_run |
False |
Continue from the current working-dir state instead of a clean snapshot. |
The core orchestrator. It drives a conversation using four tools:
ApplyPatchTool— edits files in./output/ShellTool— runs shell commands in./output/compile_tool— compiles the C++ engine in./output/build/run_tool— compiles, runs queries, and validates results against DuckDB
ScriptedConversation— plays through a JSON list of pre-written prompts; the user can interject or replace prompts interactivelyOptimizationConversation— self-steering loop that reads speedup metrics and decides when to continue, revert, or stop
Accepted prompts are persisted to a JSON file in artifacts_dir/conversations/ so runs can be replayed exactly.
The template and host process for the generated OLAP engine. The agent generates and modifies:
loader_impl.{cpp,hpp}— loads Parquet data intoParquetTablesbuilder_impl.{cpp,hpp}— transformsParquetTablesinto a customDatabaselayoutquery_impl.{cpp,hpp}— executes individual queries againstDatabasedb.cpp— host process; detects.sochanges and hot-reloads without restarting
Multi-layer caching for reproducibility:
- LLM cache — hashes requests and stores/replays responses from disk
- Shell cache — caches shell command outputs keyed by working-dir snapshot and command
- Compaction cache — caches context compaction results
GitSnapshotter— manages a git repo inside./output/to snapshot and restore the working directory between agent turns. Named snapshots are stored as git refs (refs/snapshots/*).
To clear caches, delete the relevant subdirectories under artifacts_dir/cache/ and ./output/.git.
QueryValidator runs the generated engine at multiple scale factors and compares results against DuckDB. Invoked automatically by run_tool after each compile+run.
Separate from the agent loop. See benchmark/README.md.
watch -n1 -d ./misc/get_db_procs.shduckdb :memory:.timer on
PRAGMA threads=1;
CREATE TABLE orders AS SELECT * FROM read_parquet('orders.parquet');
CREATE TABLE lineitem AS SELECT * FROM read_parquet('lineitem.parquet');
-- ... add other tables as neededTo share snapshots across machines, set up a bare git repository and start a git daemon:
git init --bare bespoke_cache.git
touch bespoke_cache.git/git-daemon-export-ok
git daemon \
--base-path=./ \
--export-all \
--enable=receive-pack \
--reuseaddr \
--verboseThe cache URL is git://<hostname>/bespoke_cache.git. Pass it via the appropriate config option, or leave it unset to use only the local snapshot cache (with --disable_repo_sync).