-
Notifications
You must be signed in to change notification settings - Fork 535
Description
Summary
When a shard is removed from a CHI (e.g. reducing shardsCount from 2 to 1), the operator does not clean up ZooKeeper entries for the removed shard's replicated tables. After the reconcile completes, paths like /clickhouse/{cluster}/tables/{removed_shard_index}/... remain in ZooKeeper.
This is confirmed by test_010014_0 "Remove shard" step, which has been XFAIL'd:
with Then(f"Shard is removed from {self.context.keeper_type}", flags=XFAIL):
out = clickhouse.query_with_error(
chi_name,
f"SELECT count() FROM system.zookeeper WHERE path ='/clickhouse/{cluster}/tables/1/default'",
)
assert "DB::Exception: No node" in out or out == "0"The assertion always fails: count() = 1 (or more) instead of 0 / "No node".
Root Cause
Issue 1: dropZKReplicas shard callback is a no-op
pkg/controller/chi/worker-deleter.go line 64:
func (w *worker) dropZKReplicas(ctx context.Context, cr *api.ClickHouseInstallation) {
cr.EnsureRuntime().ActionPlan.WalkRemoved(
func(cluster api.ICluster) {},
func(shard api.IShard) {}, // ← no-op: shard removal is silently ignored
func(host *api.Host) {
_ = w.dropZKReplica(ctx, host, NewDropReplicaOptions().SetRegularDrop())
},
)
}When shardsCount decreases, messagediff.DeepDiff detects the removal at the shard level and adds the *ChiShard to specDiff.Removed. Since *ChiShard implements IShard, shardFunc is called — which is a no-op. The hostFunc callback for individual hosts is never reached.
Issue 2: dropZKReplica uses hosts from the removed shard itself
Even if hosts were enumerated individually, dropZKReplica resolves the "host to run on" as:
if shard := hostToDrop.GetShard(); shard != nil {
hostToRunOn = shard.FirstHost() // first host of the REMOVED shard
}When an entire shard is removed, FirstHost() of the removed shard is also being deleted. By the time dropZKReplicas runs (after clean()), those pods are already gone — the SQL connection fails silently.
Issue 3: SYSTEM DROP REPLICA is wrong tool for whole-shard removal
Even if hostToRunOn were a surviving host from another shard, SYSTEM DROP REPLICA 'host-1-x' run on shard 0 would not clean shard 1's ZK paths — shard 0's ClickHouse doesn't have shard 1's tables in system.replicas.
Correct Fix
The proper approach is to run DROP TABLE IF EXISTS ... SYNC on the removed shard's hosts before clean() deletes their pods. This is the same mechanism used during full CHI deletion (deleteTables / HostDropTables), and it correctly:
- Removes the replica's ZK entries
- If it's the last replica, cleans the entire table ZK subtree
Concretely, in worker-reconciler-chi.go, before w.clean(ctx, new):
w.dropTablesFromRemovedHosts(ctx, new) // NEW: drop tables on removed hosts while pods still running
w.clean(ctx, new)
...
w.dropZKReplicas(ctx, new)Where dropTablesFromRemovedHosts walks ActionPlan.WalkRemoved and calls deleteTables() on each host in removed shards (respecting the existing HostCanDeleteAllPVCs guard for reclaimPolicy: Retain).
Additionally, dropZKReplicas should implement the shardFunc callback to handle removed shards explicitly.
Impact
- ZooKeeper accumulates stale table paths every time a shard is scaled down
- If the same shard index is later re-added, it may see stale ZK metadata
SYSTEM DROP REPLICAis never logged for shard removal (confirmed by query log check intest_010014_0)
Related
test_010014_0"Remove shard" step XFAILs:- "Shard is removed from zookeeper"
- "system.query_log should contain SYSTEM DROP REPLICA statements"
- Full CHI deletion (
deleteCHIProtocol) does NOT have this issue — it explicitly callsdeleteTableson every host before deleting K8s resources - Single-replica removal (scaling replicas 2→1 within a shard) also has a related issue:
dropZKReplicacorrectly getsshard.FirstHost()as the surviving replica, but that only works when at least one replica remains in the shard