lancedb · dantasse · Mar 24, 2026 · Mar 24, 2026 · Mar 24, 2026
diff --git a/docs/geneva/jobs/performance.mdx b/docs/geneva/jobs/performance.mdx
@@ -1,12 +1,47 @@
 ---
-title: Distributed Job Performance
-sidebarTitle: Performance
+title: Distributed Job Performance and Cluster Sizing
+sidebarTitle: Performance and Sizing
 description: Learn how to tune Geneva distributed job performance by scaling compute resources and balancing write bandwidth.
 icon: chart-line
 ---
 
 When Geneva runs in distributed mode, jobs are deployed against a Kubernetes KubeRay instance that dynamically provisions a Ray cluster. Job execution time depends on sufficient CPU/GPU resources for *computation* and sufficient *write bandwidth* to store the output values. Tuning the performance of a job boils down to configuring the table or cluster resources.
 
+## Geneva defaults
+
+Geneva sets the following defaults when creating a KubeRay cluster via GenevaClusterBuilder. These apply as both Kubernetes requests and limits.
+
+### Head Node
+| Resource | Default |
+| --- | --- |
+| CPU | 4 |
+| Memory | 8 GiB |
+| GPU | 0 |
+| Node Selector | `geneva.lancedb.com/ray-head: true` |
+| Service Account | `geneva-service-account` |
+
+The head node runs the Ray GCS (Global Control Store), the driver task, and the dashboard. It does not run UDF applier actors.
+
+### CPU Workers
+| Resource | Default |
+| --- | --- |
+| CPU | 4 |
+| Memory | 8 GiB |
+| GPU | 0 |
+| Node Selector | `geneva.lancedb.com/ray-worker-cpu: true` |
+| Replicas | 1 (min: 0, max: 100) |
+| Idle Timeout | 60 seconds |
+
+### GPU Workers
+| Resource | Default |
+| --- | --- |
+| CPU | 8 |
+| Memory | 16 GiB |
+| GPU | 1 |
+| Node Selector | `geneva.lancedb.com/ray-worker-gpu: true` |
+| Replicas | 1 (min: 0, max: 100) |
+| Idle Timeout | 60 seconds |
+
 ## Scaling computation resources
 
 Geneva jobs can split and schedule computational work into smaller batches that are assigned to *tasks* which are distributed across the cluster. As each task completes, it writes its output into a checkpoint file. If a job is interrupted or run again, Geneva will look to see if a checkpoint for the computation is already present and if not will kick off computations.
@@ -49,6 +84,48 @@ Typical per-row sizes:
 - Videos: ~10MB–200MB
 - Embeddings: `dimension * data_type_size` bytes (e.g. float32 embeddings use 4 bytes per value, so a 1536-dim embedding is `1536 * 4 = 6144` bytes)
 
+#### Internal Actor Resource Overhead
+In addition to UDF applier actors, each backfill job creates internal actors that consume resources:
+|Component |CPU |Memory |Count|
+|-------------|----|-------|-----|
+|Driver task  |0.1 |—      |1    |
+|JobTracker   |0.1 |128 MiB|1    |
+|Writer actors|0.1 |1 GiB  |1 per applier|
+|Queue actors |0   |—      |1 per applier|
+|Applier actors|UDF `num_cpus * intra_applier_concurrency`|UDF memory |`concurrency`|
+
+#### Example
+For a job with `concurrency=4`, `intra_applier_concurrency=2`, with a UDF that requests 1 CPU and 1GiB memory:
+
+|Component     |CPU  |Memory |Count|
+|--------------|-----|-------|-----|
+|Driver task   |0.1  |—      |1    |
+|JobTracker    |0.1  |128 MiB|1    |
+|Writer actors |0.1  |1 GiB  |4    |
+|Queue actors  |0    |—      |4    |
+|Applier actors|1*2=2|1 GiB  |4    |
+|Total         |0.1+0.1+4*0.1+4*2 = **8.6**|128 MiB + 4 * 1GiB + 4 * 1GiB = **8.125 GiB**||
+
+### Overall Cluster Sizing
+
+Of course, the size of a cluster will vary dramatically for each task. But if you don't know how to estimate your workload, we can recommend the following cluster sizes as a starting point:
+
+| Cluster Size | Head Node | Workers |
+| --- | --- | --- |
+| **Small** (Dev / CI) | 1 CPU, 8 GB | 4 x 2 CPU, 8 GB |
+| **Medium** (Staging) | 2 CPU, 8 GB | 4 x 4 CPU, 8 GB |
+| **Large** (Production) | 4 CPU, 8 GB | 8+ x 8 CPU, 16 GB |
+
+### Validation Thresholds
+
+The cluster builder validates memory configuration and warns or errors on suspicious values:
+
+| Threshold | Value | Behavior |
+| --- | --- | --- |
+| GPU worker minimum memory | `< 4 GiB` | **Error** — build fails if below this |
+| Large memory warning | `> 100 GB` | **Warning** — may exceed K8s node capacity |
+| Memory-per-CPU ratio warning | `> 16 GiB per CPU` | **Warning** — unusual ratio, likely misconfigured |
+
 ### GKE node pools
 
 GKE + KubeRay can autoscale the number of VM nodes on demand. Limitations on the amount of resources provisioned are configured via [node pools](https://cloud.google.com/kubernetes-engine/docs/how-to/node-pools#scale-node-pool). Node pools can be managed to scale vertically (type of machine) or horizontally (# of nodes).

diff --git a/docs/geneva/jobs/troubleshooting.mdx b/docs/geneva/jobs/troubleshooting.mdx
@@ -120,6 +120,16 @@ The client could not connect to the Ray cluster (e.g. after starting a KubeRay o
 
 If you see errors about the client already being connected (e.g. when re-running a notebook or script), ensure you're not holding an old Ray client connection. Restart the kernel or process so Ray is re-initialized cleanly. Geneva disconnects the client when the context exits; leaving a context open or reusing a stale connection can cause this.
 
+### Head node out of memory
+
+If the head node is under-provisioned (e.g. 1 CPU / 2 GB), it can OOM when:
+
+- Many workers connect and register with GCS
+- The Ray dashboard accumulates metrics
+- Object store spillover occurs
+
+**Recommendation**: Use at least 4 CPU / 8 GiB for the head node in any non-trivial deployment. This is the current default.
+
 ---
 ## Serialization library or `attrs` version mismatch
 

diff --git a/tests/py/test_geneva_defaults.py b/tests/py/test_geneva_defaults.py
@@ -0,0 +1,55 @@
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright The LanceDB Authors
+
+# These tests assert the default resource values for Geneva KubeRay cluster nodes.
+# If any of these values change, update the tables in:
+#   docs/geneva/jobs/performance.mdx (the "Geneva defaults" section)
+
+import pytest
+
+
+# TODO: remove skip once geneva 0.12.0 is on PyPI (head node defaults changed to 4 CPU / 8Gi)
+@pytest.mark.skip(reason="requires geneva>=0.12.0")
+def test_head_node_defaults():
+    # If these change, update the Head Node table in performance.mdx
+    from geneva.cluster.builder import KubeRayClusterBuilder
+
+    builder = KubeRayClusterBuilder()
+    assert builder._head_cpus == 4
+    assert builder._head_memory == "8Gi"
+    assert builder._head_gpus == 0
+    assert builder._head_node_selector == {"geneva.lancedb.com/ray-head": "true"}
+    assert builder._head_service_account == "geneva-service-account"
+
+
+def test_cpu_worker_defaults():
+    # If these change, update the CPU Workers table in performance.mdx
+    from geneva.cluster.builder import CpuWorkerBuilder
+
+    worker = CpuWorkerBuilder()
+    assert worker._num_cpus == 4
+    assert worker._memory == "8Gi"
+    assert worker._node_selector == {"geneva.lancedb.com/ray-worker-cpu": "true"}
+    assert worker._replicas == 1
+    assert worker._min_replicas == 0
+    assert worker._max_replicas == 100
+    assert worker._idle_timeout_seconds == 60
+
+    # Confirm build produces 0 GPUs
+    config = worker.build()
+    assert config.num_gpus == 0
+
+
+def test_gpu_worker_defaults():
+    # If these change, update the GPU Workers table in performance.mdx
+    from geneva.cluster.builder import GpuWorkerBuilder
+
+    worker = GpuWorkerBuilder()
+    assert worker._num_cpus == 8
+    assert worker._memory == "16Gi"
+    assert worker._num_gpus == 1
+    assert worker._node_selector == {"geneva.lancedb.com/ray-worker-gpu": "true"}
+    assert worker._replicas == 1
+    assert worker._min_replicas == 0
+    assert worker._max_replicas == 100
+    assert worker._idle_timeout_seconds == 60