Skip to content

ZooKeeper entries not cleaned up when a shard is removed #1927

@sunsingerus

Description

@sunsingerus

Summary

When a shard is removed from a CHI (e.g. reducing shardsCount from 2 to 1), the operator does not clean up ZooKeeper entries for the removed shard's replicated tables. After the reconcile completes, paths like /clickhouse/{cluster}/tables/{removed_shard_index}/... remain in ZooKeeper.

This is confirmed by test_010014_0 "Remove shard" step, which has been XFAIL'd:

with Then(f"Shard is removed from {self.context.keeper_type}", flags=XFAIL):
    out = clickhouse.query_with_error(
        chi_name,
        f"SELECT count() FROM system.zookeeper WHERE path ='/clickhouse/{cluster}/tables/1/default'",
    )
    assert "DB::Exception: No node" in out or out == "0"

The assertion always fails: count() = 1 (or more) instead of 0 / "No node".

Root Cause

Issue 1: dropZKReplicas shard callback is a no-op

pkg/controller/chi/worker-deleter.go line 64:

func (w *worker) dropZKReplicas(ctx context.Context, cr *api.ClickHouseInstallation) {
    cr.EnsureRuntime().ActionPlan.WalkRemoved(
        func(cluster api.ICluster) {},
        func(shard api.IShard) {},       // ← no-op: shard removal is silently ignored
        func(host *api.Host) {
            _ = w.dropZKReplica(ctx, host, NewDropReplicaOptions().SetRegularDrop())
        },
    )
}

When shardsCount decreases, messagediff.DeepDiff detects the removal at the shard level and adds the *ChiShard to specDiff.Removed. Since *ChiShard implements IShard, shardFunc is called — which is a no-op. The hostFunc callback for individual hosts is never reached.

Issue 2: dropZKReplica uses hosts from the removed shard itself

Even if hosts were enumerated individually, dropZKReplica resolves the "host to run on" as:

if shard := hostToDrop.GetShard(); shard != nil {
    hostToRunOn = shard.FirstHost()   // first host of the REMOVED shard
}

When an entire shard is removed, FirstHost() of the removed shard is also being deleted. By the time dropZKReplicas runs (after clean()), those pods are already gone — the SQL connection fails silently.

Issue 3: SYSTEM DROP REPLICA is wrong tool for whole-shard removal

Even if hostToRunOn were a surviving host from another shard, SYSTEM DROP REPLICA 'host-1-x' run on shard 0 would not clean shard 1's ZK paths — shard 0's ClickHouse doesn't have shard 1's tables in system.replicas.

Correct Fix

The proper approach is to run DROP TABLE IF EXISTS ... SYNC on the removed shard's hosts before clean() deletes their pods. This is the same mechanism used during full CHI deletion (deleteTables / HostDropTables), and it correctly:

  1. Removes the replica's ZK entries
  2. If it's the last replica, cleans the entire table ZK subtree

Concretely, in worker-reconciler-chi.go, before w.clean(ctx, new):

w.dropTablesFromRemovedHosts(ctx, new)   // NEW: drop tables on removed hosts while pods still running
w.clean(ctx, new)
...
w.dropZKReplicas(ctx, new)

Where dropTablesFromRemovedHosts walks ActionPlan.WalkRemoved and calls deleteTables() on each host in removed shards (respecting the existing HostCanDeleteAllPVCs guard for reclaimPolicy: Retain).

Additionally, dropZKReplicas should implement the shardFunc callback to handle removed shards explicitly.

Impact

  • ZooKeeper accumulates stale table paths every time a shard is scaled down
  • If the same shard index is later re-added, it may see stale ZK metadata
  • SYSTEM DROP REPLICA is never logged for shard removal (confirmed by query log check in test_010014_0)

Related

  • test_010014_0 "Remove shard" step XFAILs:
    • "Shard is removed from zookeeper"
    • "system.query_log should contain SYSTEM DROP REPLICA statements"
  • Full CHI deletion (deleteCHIProtocol) does NOT have this issue — it explicitly calls deleteTables on every host before deleting K8s resources
  • Single-replica removal (scaling replicas 2→1 within a shard) also has a related issue: dropZKReplica correctly gets shard.FirstHost() as the surviving replica, but that only works when at least one replica remains in the shard

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions