feat: reimplement dropped features (reload, health, tenant guard)#81
feat: reimplement dropped features (reload, health, tenant guard)#81
Conversation
Analyzed existing codebase against all 4 plans: - TenantGuard: ~50% exists (context, service, config, decorators) - Dynamic Reload: ~25% exists (observer, field tracking, config providers) - Aggregate Health: ~15% exists (reverseproxy health checker) - BDD/Contract Testing: ~65% exists (121 tests, contract CLI, CI) Revised checklists to only cover remaining work. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… emission Implements HealthProvider interface, HealthReport/AggregatedHealth types, provider adapters (simple, static, composite), and AggregateHealthService with fan-out evaluation, panic recovery, cache TTL, temporary error detection, and CloudEvent emission on status changes. 17 tests including concurrency and race-detector verified. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Introduces TenantGuard interface and StandardTenantGuard implementation with strict/lenient/disabled modes, whitelist support, ring buffer violation tracking, and CloudEvents emission on violations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…and rollback Add Reloadable interface to module.go for modules that support runtime config reloading. Implement ReloadOrchestrator with single-flight execution, exponential backoff circuit breaker, reverse-order rollback on partial failure, and CloudEvents emission for reload lifecycle. New files: - reload.go: ChangeType, ConfigChange, FieldChange, ConfigDiff, ReloadTrigger types - reload_orchestrator.go: ReloadOrchestrator with queue, circuit breaker, rollback - reload_test.go: comprehensive tests (ConfigDiff, orchestrator, rollback, circuit breaker, concurrency) Also fix pre-existing WithLogger name collision in health_service.go -> WithHealthLogger. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
📋 API Contract Changes Summary✅ No breaking changes detected - only additions and non-breaking modifications Changed Components:Core FrameworkContract diff saved to artifacts/diffs/core.json Module: authContract diff saved to artifacts/diffs/auth.json Module: cacheContract diff saved to artifacts/diffs/cache.json Module: chimuxContract diff saved to artifacts/diffs/chimux.json Module: databaseContract diff saved to artifacts/diffs/database.json Module: eventbusContract diff saved to artifacts/diffs/eventbus.json Module: eventloggerContract diff saved to artifacts/diffs/eventlogger.json Module: httpclientContract diff saved to artifacts/diffs/httpclient.json Module: httpserverContract diff saved to artifacts/diffs/httpserver.json Module: jsonschemaContract diff saved to artifacts/diffs/jsonschema.json Module: letsencryptContract diff saved to artifacts/diffs/letsencrypt.json Module: logmaskerContract diff saved to artifacts/diffs/logmasker.json Module: reverseproxyContract diff saved to artifacts/diffs/reverseproxy.json Module: schedulerContract diff saved to artifacts/diffs/scheduler.json Artifacts📁 Full contract diffs and JSON artifacts are available in the workflow artifacts. |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Reintroduces previously dropped framework capabilities by adding new core primitives for dynamic configuration reloads, aggregate health reporting, and tenant isolation enforcement, plus associated tests and plan updates.
Changes:
- Add
Reloadable+ reload diff/change types and aReloadOrchestratorwith queueing, single-flight, rollback, circuit-breaker, and reload lifecycle events. - Add aggregate health types (
HealthProvider,HealthStatus, reports) andAggregateHealthServicewith concurrent fan-out, caching, panic recovery, and health events. - Add
TenantGuardtypes +StandardTenantGuardwith modes, whitelist, ring-buffer violation history, tenant violation events, and builder options.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| tenant_guard.go | Adds TenantGuard interfaces/types and default implementation with violation tracking + event emission. |
| tenant_guard_test.go | Adds unit + concurrency tests for tenant guard modes, whitelist, ring buffer, deep copy. |
| builder.go | Adds tenant guard builder options and attempts to construct a guard during build. |
| errors.go | Adds tenant guard sentinel errors. |
| observer.go | Adds new CloudEvent type constants for tenant violations, config reload lifecycle, and health. |
| module.go | Adds the Reloadable optional module interface. |
| reload.go | Adds reload diff/change models (ConfigDiff, FieldChange, triggers) and helper methods. |
| reload_orchestrator.go | Implements reload orchestrator (queue, CAS single-flight, rollback, circuit breaker, events). |
| reload_test.go | Adds tests for diff helpers and orchestrator behavior (success, rollback, circuit breaker, concurrency, noop). |
| health.go | Adds health models (HealthStatus, HealthProvider, reports) and provider adapters. |
| health_service.go | Implements concurrent aggregate health checks with caching and health event emission. |
| health_test.go | Adds tests for providers, aggregation semantics, caching, panic recovery, events, concurrency. |
| docs/plans/dynamic-reload.md | Updates plan doc to reflect current state and remaining items post-reset. |
| docs/plans/aggregate-health.md | Updates plan doc to reflect current state and remaining items post-reset. |
| docs/plans/tenant-guard.md | Updates plan doc to reflect current state and remaining items post-reset. |
| docs/plans/bdd-contract-testing.md | Updates plan doc to reflect existing infra and remaining contract work. |
- Replace dynamic errors.New() with sentinel errors (err113) - Add StatusUnknown case to exhaustive switch (exhaustive) - Derive request context from parent context (contextcheck) - Wrap interface method error return (wrapcheck) - Fix gofmt alignment in builder.go, contract_verifier.go, observer.go - Update letsencrypt go.sum for httpserver@v1.12.0 checksum Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reload orchestrator: - Sort targets by module name for deterministic reload/rollback order - Make Stop() idempotent with sync.Once, reject requests after stop - Move noop check before started event to avoid misleading event stream - Apply default 30s timeout when ReloadTimeout() returns <= 0 - Add doc comment noting application integration is a follow-up Health service: - Switch from *log.Logger to modular.Logger for consistent structured logging - Return deep copies from Check() to prevent cache mutation by callers - Use NewCloudEvent helper for proper event ID/specversion - Prefer StatusUnhealthy over StatusUnknown on tie-breaks Tenant guard: - Switch from *log.Logger to modular.Logger for consistent structured logging - Deep-copy whitelist in constructor, convert to set for O(1) lookups - Use NewCloudEvent helper for proper event ID/specversion - Use structured logging (Warn with key-value pairs) instead of Printf - Wire guard into application service registry via builder Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add ErrReloadStopped for stopped orchestrator (distinct from channel full) - Guard against nil logger with nopLogger fallback in NewReloadOrchestrator - Fix BenchmarkReload to call processReload directly instead of queue+drain - Update rollback test to assert deterministic sorted order Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
- reload_orchestrator: add recover guard in RequestReload to prevent panic on send-to-closed-channel race with Stop() - builder: use app.RegisterService() instead of direct SvcRegistry map mutation for tenant guard registration - reload_contract_bdd_test: make rollback assertion deterministic now that targets are sorted by name Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| // worstStatus returns the worse of two health statuses. | ||
| // StatusUnknown is treated as StatusUnhealthy for aggregation purposes. | ||
| // When both normalize to the same severity, StatusUnhealthy is preferred | ||
| // over StatusUnknown. | ||
| func worstStatus(a, b HealthStatus) HealthStatus { | ||
| ar := normalizeForAggregation(a) | ||
| br := normalizeForAggregation(b) | ||
| if ar > br { | ||
| return a | ||
| } | ||
| if br > ar { | ||
| return b | ||
| } | ||
| // Tie-break: prefer StatusUnhealthy over StatusUnknown | ||
| if a == StatusUnknown && b == StatusUnhealthy { | ||
| return b | ||
| } | ||
| if b == StatusUnknown && a == StatusUnhealthy { | ||
| return a | ||
| } | ||
| return a | ||
| } |
There was a problem hiding this comment.
worstStatus is documented to treat StatusUnknown as StatusUnhealthy for aggregation, but it returns the original input status when it “wins” the comparison. For example, worstStatus(StatusHealthy, StatusUnknown) returns StatusUnknown, which allows aggregated readiness/health to remain "unknown" instead of "unhealthy". To match the doc, return a normalized status (or explicitly map Unknown→Unhealthy before returning).
There was a problem hiding this comment.
Fixed in d29ee4a. worstStatus now maps StatusUnknown → StatusUnhealthy in the return value, so aggregated health never reports "unknown" — consistent with the documented "treated as Unhealthy" semantics.
reload_test.go
Outdated
| // Wait for processing. | ||
| time.Sleep(100 * time.Millisecond) | ||
|
|
There was a problem hiding this comment.
These reload orchestrator tests rely on fixed time.Sleep(...) delays to wait for async processing. This can be flaky on slower/contended CI runners and can also slow the suite unnecessarily. Prefer deterministic synchronization (e.g., a channel/WaitGroup in the mock Reloadable or polling until a condition/event is observed with a bounded timeout) so the test proceeds as soon as the reload completes.
There was a problem hiding this comment.
Fixed in d29ee4a. Replaced all time.Sleep waits with a waitFor polling helper that checks the condition every 5ms with a 2s timeout. Tests are now deterministic and CI-safe regardless of runner speed.
| diff := rc.newDiff() | ||
| rc.reloadErr = rc.orchestrator.RequestReload(rc.ctx, ReloadManual, diff) | ||
| time.Sleep(200 * time.Millisecond) | ||
| return nil |
There was a problem hiding this comment.
These Godog step implementations use time.Sleep(...) to wait for asynchronous reload processing. This can introduce intermittent failures and make the BDD suite sensitive to timing. Consider synchronizing on an explicit completion signal (e.g., wait until the expected event type is observed, with a timeout) instead of sleeping a fixed duration.
There was a problem hiding this comment.
Fixed in d29ee4a. Replaced all time.Sleep calls with bddWaitForEvent and bddWaitForCalls polling helpers that wait for the expected event type or call count with a 2s timeout. No more timing-sensitive assertions.
reload_orchestrator.go
Outdated
| // Derive a context from the start context. If the request carries | ||
| // a deadline, apply it to the parent context. | ||
| rctx := ctx | ||
| if req.Ctx != nil { | ||
| if deadline, ok := req.Ctx.Deadline(); ok { | ||
| dctx, cancel := context.WithDeadline(ctx, deadline) | ||
| rctx = dctx | ||
| defer cancel() | ||
| } | ||
| } |
There was a problem hiding this comment.
In Start(), the per-request deadline context uses defer cancel() inside the loop. Those defers won’t run until the goroutine exits, so timers/contexts accumulate and can leak resources under sustained reload traffic. Call cancel immediately after processReload returns (or scope the context creation to a helper function) rather than deferring inside the loop.
There was a problem hiding this comment.
Fixed in d29ee4a. Extracted handleReload() method that creates a scoped context with context.WithCancel and uses defer cancel() within the method — context is released immediately after each reload, no accumulation in the loop.
reload_orchestrator.go
Outdated
| // Derive a context from the start context. If the request carries | ||
| // a deadline, apply it to the parent context. | ||
| rctx := ctx | ||
| if req.Ctx != nil { | ||
| if deadline, ok := req.Ctx.Deadline(); ok { | ||
| dctx, cancel := context.WithDeadline(ctx, deadline) | ||
| rctx = dctx | ||
| defer cancel() | ||
| } | ||
| } |
There was a problem hiding this comment.
RequestReload accepts a context, but Start() only copies the deadline from req.Ctx and otherwise runs reloads under the Start() parent context. This means request cancellation/values won’t propagate to module Reload calls, which is surprising for callers and makes it impossible to cancel a queued reload. Consider deriving rctx from req.Ctx (e.g., rctx := req.Ctx with a fallback to ctx) and only overriding the parent when needed.
There was a problem hiding this comment.
Fixed in d29ee4a. handleReload() now uses req.Ctx as the base context when provided (falls back to the Start parent context otherwise), so caller cancellation and values fully propagate to module Reload() calls. A context.WithCancel wrapper ensures cleanup on completion.
- reload_orchestrator: extract handleReload() to properly scope context lifetime (no defer cancel leak in loop), propagate req.Ctx values and cancellation to module Reload calls instead of only copying deadline - health_service: worstStatus now maps StatusUnknown → StatusUnhealthy in output, consistent with documented aggregation behavior - reload_test: replace all time.Sleep waits with polling waitFor helper for deterministic CI-safe synchronization - reload_contract_bdd_test: replace time.Sleep waits with event/call polling helpers (bddWaitForEvent, bddWaitForCalls) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep context chain rooted in parentCtx and apply request deadline via context.WithDeadline instead of swapping the base context. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
contract_verifier.go
Outdated
| // checkReloadIdempotent calls Reload with empty changes twice and returns an error | ||
| // if either call fails. | ||
| func (v *StandardContractVerifier) checkReloadIdempotent(module Reloadable) error { | ||
| ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) | ||
| defer cancel() | ||
|
|
||
| if err := module.Reload(ctx, nil); err != nil { | ||
| return fmt.Errorf("first call: %w", err) | ||
| } | ||
| if err := module.Reload(ctx, nil); err != nil { | ||
| return fmt.Errorf("second call: %w", err) | ||
| } | ||
| return nil |
There was a problem hiding this comment.
checkReloadIdempotent calls module.Reload(ctx, nil) directly. If a module violates the contract by ignoring ctx cancellation/timeouts (the very thing this verifier is meant to detect), VerifyReloadContract can hang indefinitely. Consider running Reload in a goroutine and selecting on ctx.Done() so the verifier returns a violation instead of blocking forever.
There was a problem hiding this comment.
Fixed in e42a1d4. checkReloadIdempotent now runs each Reload call in a goroutine with a select on ctx.Done(), so a misbehaving module that ignores context cancellation causes a timeout violation instead of blocking the verifier.
| for range len(providers) { | ||
| result := <-ch | ||
|
|
There was a problem hiding this comment.
AggregateHealthService.Check waits for exactly len(providers) results using a blocking receive, but it never selects on ctx.Done(). If a provider blocks or ignores context cancellation, Check can hang indefinitely even after the caller context is cancelled/deadline exceeded. Consider receiving results with a select on ctx.Done() and returning ctx.Err() (or a partial result) when the context is done to prevent stuck callers.
There was a problem hiding this comment.
Fixed in e42a1d4. Check() now uses select on both the result channel and ctx.Done() when collecting provider results. If the caller context is cancelled or times out, Check returns ctx.Err() immediately instead of blocking.
| // Emit event using NewCloudEvent helper (sets ID, specversion, time) | ||
| if g.subject != nil { | ||
| event := NewCloudEvent(EventTypeTenantViolation, "com.modular.tenant.guard", violation, nil) | ||
| _ = g.subject.NotifyObservers(ctx, event) | ||
| } |
There was a problem hiding this comment.
ValidateAccess ignores the error returned by Subject.NotifyObservers. This can silently drop tenant violation events (e.g., misconfigured observers) and makes troubleshooting harder. Consider handling the returned error (at least log it via g.logger when available, or propagate it depending on desired semantics).
There was a problem hiding this comment.
Fixed in e42a1d4. The error from NotifyObservers is now logged via g.logger.Warn() when a logger is configured, instead of being silently discarded.
| // TenantGuardStrict blocks the operation and returns an error on violation. | ||
| TenantGuardStrict TenantGuardMode = iota | ||
| // TenantGuardLenient logs the violation but allows the operation to proceed. | ||
| TenantGuardLenient | ||
| // TenantGuardDisabled performs no validation at all. |
There was a problem hiding this comment.
The TenantGuardLenient doc comment says it "logs the violation", but logging is conditional on both LogViolations and a non-nil logger being configured. Either update the comment to reflect that logging is best-effort/optional, or ensure a default logger is always set when lenient mode is enabled (e.g., in builder wiring).
There was a problem hiding this comment.
Fixed in e42a1d4. Updated the TenantGuardLenient doc comment to clarify that logging is best-effort and conditional on LogViolations being true and a logger being configured.
| func (rc *ReloadBDDContext) theCircuitBreakerShouldEventuallyReset() error { | ||
| // Manually reset the circuit breaker state to simulate backoff elapsed. | ||
| rc.orchestrator.cbMu.Lock() | ||
| rc.orchestrator.circuitOpen = false | ||
| rc.orchestrator.failures = 0 | ||
| rc.orchestrator.cbMu.Unlock() |
There was a problem hiding this comment.
The BDD step "the circuit breaker should eventually reset" manually clears orchestrator.circuitOpen/failures, so it doesn't actually validate the backoff/reset behavior implemented in isCircuitOpen/backoffDuration. To make this contract meaningful, simulate elapsed backoff by adjusting lastFailure to sufficiently in the past (or inject a clock) and then assert that RequestReload is accepted and the circuit is reset without directly mutating circuitOpen/failures.
There was a problem hiding this comment.
Fixed in e42a1d4. The BDD step now simulates elapsed backoff by backdating lastFailure to time.Now().Add(-circuitBreakerMaxDelay - 1s), which exercises the actual isCircuitOpen()/backoffDuration() logic rather than bypassing it.
reload_orchestrator.go
Outdated
| // When the request carries a context with a deadline, that deadline is applied to | ||
| // the parent context so that both caller deadlines and Start() cancellation are | ||
| // respected. Request context values are not propagated to keep the context chain | ||
| // rooted in parentCtx (required by contextcheck linter). | ||
| func (o *ReloadOrchestrator) handleReload(parentCtx context.Context, req ReloadRequest) { | ||
| rctx, cancel := context.WithCancel(parentCtx) | ||
| defer cancel() | ||
|
|
||
| // If the request carries a deadline, tighten the context with it. | ||
| if req.Ctx != nil { | ||
| if deadline, ok := req.Ctx.Deadline(); ok { | ||
| rctx, cancel = context.WithDeadline(rctx, deadline) //nolint:contextcheck // deadline from request | ||
| defer cancel() | ||
| } | ||
| } |
There was a problem hiding this comment.
handleReload only uses req.Ctx to copy over a deadline; it does not respect req.Ctx cancellation when there is no deadline (or when the caller cancels early). Since RequestReload accepts a ctx, callers may reasonably expect cancellation to abort in-flight reload work. Consider wiring req.Ctx.Done() into rctx cancellation (without propagating values) so both deadlines and explicit cancellation are honored.
There was a problem hiding this comment.
Fixed in e42a1d4. handleReload now spawns a background goroutine that watches req.Ctx.Done() and cancels the reload context when the request context is cancelled. This ensures both deadlines and explicit cancellation from callers propagate to module Reload() calls.
- contract_verifier: guard checkReloadIdempotent with goroutine+select so a misbehaving module cannot block the verifier indefinitely - health_service: Check now selects on ctx.Done() when collecting provider results, returning ctx.Err() on cancellation/timeout - tenant_guard: log NotifyObservers errors instead of silently dropping; update TenantGuardLenient doc to clarify logging is best-effort - reload_contract_bdd_test: simulate elapsed backoff by backdating lastFailure instead of manually clearing circuit breaker state - reload_orchestrator: propagate req.Ctx cancellation via background goroutine watching req.Ctx.Done(), not just deadline Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add ErrReloadTimeout sentinel error, use in contract verifier - Wrap ctx.Err() in health_service.Check for wrapcheck compliance Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| // In strict mode, return error | ||
| if g.config.Mode == TenantGuardStrict { | ||
| return ErrTenantIsolationViolation | ||
| } |
There was a problem hiding this comment.
In strict mode ValidateAccess always returns ErrTenantIsolationViolation, even when the violation type is MissingContext/InvalidContext. Since ErrTenantContextMissing is defined, consider returning that (at least for MissingContext) so callers can distinguish missing tenant context from a true cross-tenant isolation violation.
| go func() { | ||
| ch <- result{err: module.Reload(ctx, nil)} | ||
| }() |
There was a problem hiding this comment.
Contract verification currently calls module.Reload without panic recovery (both in the goroutine here and in the direct call in checkReloadRespectsCancel). A panicking Reload implementation will crash the verifier/test run instead of being reported as a contract violation; wrap Reload invocations with recover and surface a violation describing the panic.
| go func() { | ||
| reports, err := provider.HealthCheck(ctx) | ||
| ch <- result{reports, err} | ||
| }() |
There was a problem hiding this comment.
VerifyHealthContract runs provider.HealthCheck in a goroutine (and again later for cancellation) without panic recovery. A panicking provider will crash the verifier/test suite; add a recover guard and report a contract violation when HealthCheck panics instead of letting the panic escape.
Summary
Reimplements 3 of the 4 features dropped during the CrisisTextLine/modular upstream reset:
Dynamic Reload Manager
Reloadableinterface for modules that support live config reloadingReloadOrchestratorwith buffered request queue, atomic CAS single-flight, circuit breaker (exponential backoff), and rollback on partial failureConfigDiffwith change tracking, prefix filtering, sensitive field redactionAggregate Health Service
HealthProviderinterface withHealthStatusenum (Unknown/Healthy/Degraded/Unhealthy)AggregateHealthServicewith concurrent fan-out, panic recovery, TTL cache (250ms), force-refresh context keyTenantGuard Enforcement
TenantGuardinterface withValidateAccess()for 3 modes: Strict (block), Lenient (log), Disabled (no-op)WithTenantGuardMode(),WithTenantGuardConfig()Revised Plans
Test plan
go build ./...passesgo vet ./...passes-raceflag🤖 Generated with Claude Code