diff --git a/.gitignore b/.gitignore index bd61081b..225e254d 100644 --- a/.gitignore +++ b/.gitignore @@ -1,4 +1,5 @@ # dev environment +.DS_Store .vscode .scratch docker/control-plane-dev/data diff --git a/docs/development/supported-services.md b/docs/development/supported-services.md new file mode 100644 index 00000000..681a1a91 --- /dev/null +++ b/docs/development/supported-services.md @@ -0,0 +1,1451 @@ +# Adding a Supported Service + +This guide explains how to add a new supported service type to the control plane. +MCP is the reference implementation throughout. All touch points are summarized in +the [checklist](#checklist-adding-a-new-service-type) at the end. + +> [!CAUTION] +> This document is subject to change. The MCP service is not yet fully +> implemented, and several areas are still in active design: +> +> - **Configuration delivery**: How config reaches service containers (env vars, +> config files, mounts, etc.) is still being designed and may end up being +> service-specific. +> - **Workflow refactoring**: Some of the workflow code (e.g., service +> provisioning) needs to be refactored based on PR feedback. This work is +> being coordinated with the SystemD migration tasks. +> +> Treat this as a snapshot of the current architecture, not a stable contract. + +## Overview + +A supported service is a containerized application deployed alongside a pgEdge +database. Services are declared in `DatabaseSpec.Services`, provisioned after +the database is available, and deleted when the database is deleted. + +One `ServiceSpec` produces N `ServiceInstance` records — one per `host_id` in +the spec. Each instance runs as its own Docker Swarm service on the database's +overlay network, with its own dedicated Postgres credentials. + +## Design Constraint: Services Depend on the Database Only + +The current architecture constrains every service to have exactly one +dependency: the parent database. There is no concept of service provisioning +ordering or dependencies between services. All services declared in a database +spec are provisioned independently and in parallel once the database is +available. + +This means: + +- A service cannot declare a dependency on another service +- There is no ordering guarantee between services (service A may start before + or after service B) +- A service cannot discover another service's hostname or connection info at + provisioning time + +This constraint keeps the model simple and parallelizable, but it will likely +need to be relaxed in the future. Scaling out the AI workbench will require +multi-container service types where one component depends on another (e.g., a +web client that proxies to an API server). Supporting this will require +introducing service-to-service dependencies, provisioning ordering, and +health-gated startup — effectively a dependency graph within the service layer. + +The data flow from API request to running container looks like this: + +```text +API Request (spec.services) + → Validation (API layer) + → Store spec in etcd + → UpdateDatabase workflow + → PlanUpdate sub-workflow + → For each (service, host_id): + → Resolve service image from registry + → Validate Postgres/Spock version compatibility + → Generate resource objects (network, user role, spec, instance, monitor) + → Merge service resources into EndState + → Compute plan (diff current state vs. desired state) + → Apply plan (execute resource Create/Update/Delete) + → Service container running +``` + +## API Layer + +The `ServiceSpec` Goa type is defined in `api/apiv1/design/database.go`. It +lives inside the `DatabaseSpec` as an array: + +```go +g.Attribute("services", g.ArrayOf(ServiceSpec), func() { ... }) +``` + +The `ServiceSpec` type has these attributes: + +| Attribute | Type | Description | +|-----------|------|-------------| +| `service_id` | `Identifier` | Unique ID for this service within the database | +| `service_type` | `String` (enum) | The type of service (e.g., `"mcp"`) | +| `version` | `String` | Semver (e.g., `"1.0.0"`) or `"latest"` | +| `host_ids` | `[]Identifier` | Which hosts should run this service (one instance per host) | +| `port` | `Int` (optional) | Host port to publish; `0` = random; omitted = not published | +| `config` | `MapOf(String, Any)` | Service-specific configuration | +| `cpus` | `String` (optional) | CPU limit; accepts SI suffix `m` (e.g., `"500m"`, `"1"`) | +| `memory` | `String` (optional) | Memory limit in SI or IEC notation (e.g., `"512M"`, `"1GiB"`) | + +To add a new service type, add its string value to the enum on `service_type`: + +```go +g.Attribute("service_type", g.String, func() { + g.Enum("mcp", "my-new-service") +}) +``` + +The `config` attribute is intentionally `MapOf(String, Any)` so the API schema +doesn't change when new service types are added. Config structure is validated +at the application layer (see [Validation](#validation)). + +After editing the design file, regenerate the API code: + +```sh +make -C api generate +``` + +## Validation + +There are two validation layers that catch different classes of errors. + +### API validation (HTTP 400) + +`validateServiceSpec()` in `server/internal/api/apiv1/validate.go` runs on +every API request that includes services. It performs fast, syntactic checks: + +1. **service_id**: must be a valid identifier (the shared `validateIdentifier` + function) +2. **service_type**: must be in the allowlist. Currently this is a direct check: + ```go + if svc.ServiceType != "mcp" { + // error: unsupported service type + } + ``` + Add your new type here. +3. **version**: must match semver format or be the literal `"latest"` +4. **host_ids**: must be unique within the service (duplicate host IDs are + rejected) +5. **config**: dispatches to a per-type config validator: + ```go + if svc.ServiceType == "mcp" { + errs = append(errs, validateMCPServiceConfig(svc.Config, ...)...) + } + ``` + +The per-type config validator is where you enforce required fields, type +correctness, and service-specific constraints. For example, +`validateMCPServiceConfig()` in the same file: + +- Requires `llm_provider` and `llm_model` +- Validates `llm_provider` is one of `"anthropic"`, `"openai"`, `"ollama"` +- Requires the provider-specific API key (`anthropic_api_key`, `openai_api_key`, + or `ollama_url`) based on the chosen provider + +Write a parallel `validateMyServiceConfig()` function for your service type and +add a dispatch branch in `validateServiceSpec()`. + +### Workflow-time validation + +`GetServiceImage()` in `server/internal/orchestrator/swarm/service_images.go` is +called during workflow execution. If the `service_type`/`version` combination is +not registered in the image registry, it returns an error that fails the +workflow task and sets the database to `"failed"` state. Note that this is +distinct from post-provision health-check failures detected by the service +instance monitor, which transition individual `ServiceInstance` records to +`"failed"` but do **not** change the parent database's state. + +This catches cases where the API validation passes (valid semver, known type) +but the specific version hasn't been registered. The E2E test +`TestProvisionMCPServiceUnsupportedVersion` in `e2e/service_provisioning_test.go` +demonstrates this: version `"99.99.99"` passes API validation but fails at +workflow time. + +## Service Image Registry + +The image registry maps `(serviceType, version)` pairs to container image +references. It lives in `server/internal/orchestrator/swarm/service_images.go`. + +### Data structures + +```go +type ServiceImage struct { + Tag string // Full image:tag reference + PostgresConstraint *host.VersionConstraint // Optional: restrict PG versions + SpockConstraint *host.VersionConstraint // Optional: restrict Spock versions +} + +type ServiceVersions struct { + images map[string]map[string]*ServiceImage // serviceType → version → image +} +``` + +### Registering a new service type + +Add your image in `NewServiceVersions()`: + +```go +func NewServiceVersions(cfg config.Config) *ServiceVersions { + versions := &ServiceVersions{...} + + // Existing MCP registration + versions.addServiceImage("mcp", "latest", &ServiceImage{ + Tag: serviceImageTag(cfg, "postgres-mcp:latest"), + }) + + // Your new service + versions.addServiceImage("my-service", "1.0.0", &ServiceImage{ + Tag: serviceImageTag(cfg, "my-service:1.0.0"), + }) + + return versions +} +``` + +`serviceImageTag()` prepends the configured registry host +(`cfg.DockerSwarm.ImageRepositoryHost`) unless the image reference already +contains a registry prefix (detected by checking if the first path component +contains a `.`, `:`, or is `localhost`). + +### Version constraints + +If your service is only compatible with specific Postgres or Spock versions, set +the constraint fields: + +```go +versions.addServiceImage("my-service", "1.0.0", &ServiceImage{ + Tag: serviceImageTag(cfg, "my-service:1.0.0"), + PostgresConstraint: &host.VersionConstraint{ + Min: host.MustParseVersion("15"), + Max: host.MustParseVersion("17"), + }, + SpockConstraint: &host.VersionConstraint{ + Min: host.MustParseVersion("4.0.0"), + }, +}) +``` + +`ValidateCompatibility()` checks these constraints against the database's +running versions during `GenerateServiceInstanceResources()` in the workflow. +Constraint failures produce errors like `"postgres version 14 does not satisfy +constraint >=15"`. + +## Resource Lifecycle + +Every service instance is represented by five resources that participate in the +standard plan/apply reconciliation cycle. These resource types are generic — +adding a new service type does **not** require new resource types. + +### Dependency chain + +```text +Phase 1: Network (swarm.network) — no dependencies + ServiceUserRole (swarm.service_user_role) — no dependencies +Phase 2: ServiceInstanceSpec (swarm.service_instance_spec) — depends on Network + ServiceUserRole +Phase 3: ServiceInstance (swarm.service_instance) — depends on ServiceUserRole + ServiceInstanceSpec +Phase 4: ServiceInstanceMonitor (monitor.service_instance) — depends on ServiceInstance +``` + +On deletion, the order reverses: monitor first, then instance, spec, and +finally the network and user role. + +### What each resource does + +**Network** (`server/internal/orchestrator/swarm/network.go`): Creates a Docker +Swarm overlay network for the database. The network is shared between Postgres +instances and service instances of the same database, so it deduplicates +naturally via identifier matching (both generate the same +`"{databaseID}-database"` name). Uses the IPAM service to allocate subnets. +Runs on `ManagerExecutor`. + +**ServiceUserRole** (`server/internal/orchestrator/swarm/service_user_role.go`): +Manages the Postgres user lifecycle for a service instance. On `Create`, it +generates a deterministic username via `database.GenerateServiceUsername()` (format: +`svc_{serviceID}_{hostID}`), generates a random 32-byte password, and creates a +Postgres role with `pgedge_application_read_only` privileges. On `Delete`, it +drops the role. Runs on `HostExecutor(postgresHostID)` because it needs Docker +access to the Postgres container for direct database connectivity. See +`docs/development/service-credentials.md` for full details on credential +generation. + +**ServiceInstanceSpec** (`server/internal/orchestrator/swarm/service_instance_spec.go`): +A virtual resource that generates the Docker Swarm `ServiceSpec`. Its +`Refresh()`, `Create()`, and `Update()` methods all call `ServiceContainerSpec()` +to compute the spec from the current inputs. The computed spec is stored in the +`Spec` field and consumed by the `ServiceInstance` resource during deployment. +`Delete()` is a no-op. Runs on `HostExecutor(hostID)`. + +**ServiceInstance** (`server/internal/orchestrator/swarm/service_instance.go`): +The resource that actually deploys the Docker Swarm service. On `Create`, it +stores an initial etcd record with `state="creating"`, then calls +`client.ServiceDeploy()` with the spec from `ServiceInstanceSpec`, waits up to +5 minutes for the service to start, and transitions the state to `"running"`. On +`Delete`, it scales the service to 0 replicas (waiting for containers to stop), +removes the Docker service, and deletes the etcd record. Runs on +`ManagerExecutor`. + +**ServiceInstanceMonitor** (`server/internal/monitor/service_instance_monitor_resource.go`): +Registers (or deregisters) a health monitor for the service instance. The +monitor periodically checks the service's `/health` endpoint and updates status +in etcd. Runs on `HostExecutor(hostID)`. + +### How resources are generated + +`GenerateServiceInstanceResources()` in +`server/internal/orchestrator/swarm/orchestrator.go` is the entry point. It: + +1. Resolves the `ServiceImage` via `GetServiceImage(serviceType, version)` +2. Validates Postgres/Spock version compatibility if constraints exist +3. Constructs the four orchestrator resources (`Network`, `ServiceUserRole`, + `ServiceInstanceSpec`, `ServiceInstance`) +4. Serializes them to `[]*resource.ResourceData` via `resource.ToResourceData()` +5. Returns a `*database.ServiceInstanceResources` wrapper + +The monitor resource is added separately in the workflow layer (see +[Workflow Integration](#workflow-integration)). + +All resource types are registered in +`server/internal/orchestrator/swarm/resources.go` via +`resource.RegisterResourceType[*T](registry, ResourceTypeConstant)`, which +enables the resource framework to deserialize stored state back into typed +structs. + +## Container Spec + +`ServiceContainerSpec()` in `server/internal/orchestrator/swarm/service_spec.go` +builds the Docker Swarm `ServiceSpec` from a `ServiceContainerSpecOptions` +struct. The options contain everything needed to build the spec: the +`ServiceSpec`, credentials, image, network IDs, database connection info, and +placement constraints. + +The generated spec configures: + +- **Placement**: pins the container to a specific Swarm node via + `node.id==` +- **Networks**: attaches to both the default bridge network (for control plane + access and external connectivity) and the database overlay network (for + Postgres connectivity) +- **Port publication**: `buildServicePortConfig()` publishes port 8080 in host + mode. If the `port` field in the spec is nil, no port is published. If it's + 0, Docker assigns a random port. If it's a specific value, that port is used. +- **Health check**: currently configured to `curl -f http://localhost:8080/health` + with a 30s start period, 10s interval, 5s timeout, and 3 retries +- **Resource limits**: CPU and memory limits from the spec, if provided + +> [!NOTE] +> How configuration reaches the service container (environment variables, config +> files, mounts, etc.) is still evolving and will vary per service type. See +> `buildServiceEnvVars()` in `service_spec.go` for the current MCP +> implementation, but expect this area to change as new service types are added. + +For a new service type, you may need to: + +- Extend `ServiceContainerSpecOptions` with service-specific fields +- Add branches in `ServiceContainerSpec()` or its helper functions to handle + the new service type's requirements (different health check endpoint, different + target port, mount points, entrypoint, etc.) + +## Workflow Integration + +The workflow layer is generic. No per-service-type changes are needed here. + +### PlanUpdate + +`PlanUpdate` in `server/internal/workflows/plan_update.go` is the sub-workflow +that computes the reconciliation plan. It: + +1. Computes `NodeInstances` from the spec +2. Generates node resources (same as before services existed) +3. Determines a `postgresHostID` for `ServiceUserRole` executor routing — + `ServiceUserRole.Create()` needs local Docker access to the Postgres + container, so it picks the first available Postgres host +4. Iterates `spec.Services` and for each `(service, hostID)` pair, calls + `getServiceResources()` +5. Passes both node and service resources to `operations.UpdateDatabase()` + +### getServiceResources + +`getServiceResources()` in the same file builds an `operations.ServiceResources` +for a single service instance: + +1. Generates the `ServiceInstanceID` via + `database.GenerateServiceInstanceID(databaseID, serviceID, hostID)` +2. Resolves the Postgres hostname via `findPostgresInstance()`, which prefers a + co-located instance (same host as the service) for lower latency but falls + back to any instance in the database +3. Constructs a `database.ServiceInstanceSpec` with all the inputs +4. Fires the `GenerateServiceInstanceResources` activity (executes on the + manager queue) +5. Wraps the result in `operations.ServiceResources`, adding the + `ServiceInstanceMonitorResource` + +### EndState + +`EndState()` in `server/internal/database/operations/end.go` merges service +resources into the desired end state: + +```go +for _, svc := range services { + state, err := svc.State() + end.Merge(state) +} +``` + +Service resources always land in the final plan, after all node operations. This +is because intermediate states (from `UpdateNodes`, `AddNodes`, +`PopulateNodes`) only contain node diffs. `PlanAll` produces one plan per state +transition, so services end up in the last plan (the diff from the last +intermediate state to EndState). + +Resources that exist in the current state but are absent from the end state are +automatically marked `PendingDeletion` by the plan engine, which generates +delete events in reverse dependency order. + +## Domain Model + +### ServiceSpec + +`server/internal/database/spec.go`: + +```go +type ServiceSpec struct { + ServiceID string `json:"service_id"` + ServiceType string `json:"service_type"` + Version string `json:"version"` + HostIDs []string `json:"host_ids"` + Config map[string]any `json:"config"` + Port *int `json:"port,omitempty"` + CPUs *float64 `json:"cpus,omitempty"` + MemoryBytes *uint64 `json:"memory,omitempty"` +} +``` + +This is the spec-level declaration that lives inside `Spec.Services`. It's +service-type-agnostic — no fields are MCP-specific. The `Config` map holds all +service-specific settings. + +### ServiceInstance + +`server/internal/database/service_instance.go`: + +The runtime artifact that tracks an individual container's state. Key fields: +`ServiceInstanceID`, `ServiceID`, `DatabaseID`, `HostID`, `State` +(`creating`/`running`/`failed`/`deleting`), `Status` (container ID, hostname, +IP, port mappings, health), and `Credentials` (`ServiceUser` with username and +password). + +### ID generation + +| Function | Format | Example | +|----------|--------|---------| +| `GenerateServiceInstanceID(dbID, svcID, hostID)` | `"{dbID}-{svcID}-{hostID}"` | `"mydb-mcp-host1"` | +| `GenerateServiceUsername(svcID, hostID)` | `"svc_{svcID}_{hostID}"` | `"svc_mcp_host1"` | +| `GenerateDatabaseNetworkID(dbID)` | `"{dbID}"` | `"mydb"` | + +`GenerateDatabaseNetworkID` returns the **resource identifier** used to look up +the overlay network in the resource registry. The actual Docker Swarm network +name is `"{databaseID}-database"` (set in the `Network.Name` field in +`orchestrator.go`). + +Usernames longer than 63 characters are truncated with a deterministic hash +suffix. See `docs/development/service-credentials.md` for details. + +### ServiceResources + +`server/internal/database/operations/common.go`: + +```go +type ServiceResources struct { + ServiceInstanceID string + Resources []*resource.ResourceData + MonitorResource resource.Resource +} +``` + +The operations-layer wrapper that bridges the orchestrator output and the +planning system. `Resources` holds the serialized orchestrator resources (from +`GenerateServiceInstanceResources`). `MonitorResource` is the +`ServiceInstanceMonitorResource`. The `State()` method merges both into a +`resource.State` for use in `EndState()`. + +## Testing + +### Unit tests + +**Container spec tests** in +`server/internal/orchestrator/swarm/service_spec_test.go`: + +Table-driven tests for `ServiceContainerSpec()` and its helpers. The test +pattern uses per-case check functions: + +```go +{ + name: "basic MCP service", + opts: &ServiceContainerSpecOptions{...}, + checks: []checkFunc{ + checkLabels(expectedLabels), + checkNetworks("bridge", "my-db-database"), + checkEnv("PGHOST=...", "PGPORT=5432", ...), + checkPlacement("node.id==swarm-node-1"), + checkHealthcheck("/health", 8080), + checkPorts(8080, 5434), + }, +} +``` + +Add test cases for your new service type here, particularly if it has different +health check endpoints, ports, or other spec differences. + +**Image registry tests** in +`server/internal/orchestrator/swarm/service_images_test.go`: + +Tests `GetServiceImage()`, `SupportedServiceVersions()`, and +`ValidateCompatibility()`. Covers both happy path (valid type + version) and +error cases (unsupported type, unregistered version, constraint violations). Add +test cases for your new service type's image registration and any version +constraints. + +### Golden plan tests + +The golden plan tests in +`server/internal/database/operations/update_database_test.go` validate that +service resources are correctly integrated into the plan/apply reconciliation +cycle. + +**How they work:** + +1. Build a start `*resource.State` representing the current state of the world +2. Build `[]*operations.NodeResources` and `[]*operations.ServiceResources` + representing the desired state +3. Call `operations.UpdateDatabase()` to compute the plan +4. Summarize the plans via `resource.SummarizePlans()` +5. Compare against a committed JSON golden file via `testutils.GoldenTest[T]` + +**Test helpers** in `server/internal/database/operations/helpers_test.go` define +stub resource types that mirror the real swarm types' `Identifier()`, +`Dependencies()`, `DiffIgnore()`, and `Executor()` without importing the swarm +package. This avoids pulling in the Docker SDK and keeps tests self-contained. +The stubs use the `orchestratorResource` embedding pattern already established +in the file. + +> [!NOTE] +> The service-specific test stubs (`makeServiceResources()` and its companion +> types) are being added as part of PLAT-412. Once merged, `makeServiceResources()` +> will construct a complete set of stub resources for a single service instance, +> serialize them to `[]*resource.ResourceData`, create the real +> `monitor.ServiceInstanceMonitorResource`, and return the +> `operations.ServiceResources` wrapper. + +**Five standard test cases** will cover the full lifecycle: + +| Test case | Start state | Services | Verifies | +|-----------|-------------|----------|----------| +| `single node with service from empty` | empty | 1 service | Service resources created in correct phase order alongside database resources | +| `single node with service no-op` | node + service | same service | Unchanged services produce an empty plan (the core regression test) | +| `add service to existing database` | node only | 1 new service | Only service create events, no database changes | +| `remove service from existing database` | node + service | nil | Service delete events in reverse dependency order, database unchanged | +| `update database node with unchanged service` | node + service | same service | Only database update events, service resources untouched | + +These test cases are generic and apply regardless of service type. To regenerate +golden files after changes: + +```sh +go test ./server/internal/database/operations/... -run TestUpdateDatabase -update +``` + +Always review the generated JSON files in +`golden_test/TestUpdateDatabase/` before committing. + +### E2E tests + +E2E tests in `e2e/service_provisioning_test.go` validate service provisioning +against a real control plane cluster. + +Build tag: `//go:build e2e_test` + +The tests use `fixture.NewDatabaseFixture()` for auto-cleanup and poll the API +for state transitions. Key patterns to replicate for a new service type: + +| Pattern | Example test | What it validates | +|---------|-------------|-------------------| +| Single-host provision | `TestProvisionMCPService` | Service reaches `"running"` state | +| Multi-host provision | `TestProvisionMultiHostMCPService` | One instance per host, all reach `"running"` | +| Add to existing DB | `TestUpdateDatabaseAddService` | Service added without affecting database | +| Remove from DB | `TestUpdateDatabaseRemoveService` | Empty `Services` array removes the service | +| Stability | `TestUpdateDatabaseServiceStable` | Unrelated DB update doesn't recreate service (checks `created_at` and `container_id` unchanged) | +| Bad version | `TestProvisionMCPServiceUnsupportedVersion` | Unregistered version fails task, DB goes to `"failed"` | +| Recovery | `TestProvisionMCPServiceRecovery` | Failed DB recovered by updating with valid version | + +Run service E2E tests: + +```sh +make test-e2e E2E_RUN=TestProvisionMCPService +``` + +## API Usage & Verification + +This section shows how services look in the API request and response payloads. +Use these examples to verify your integration or to hand-test with `curl`. + +### Creating a Database with a Service + +`POST /v1/databases` + +```json +{ + "id": "my-app", + "spec": { + "database_name": "storefront", + "port": 5432, + "database_users": [ + { + "username": "admin", + "password": "secret", + "db_owner": true, + "attributes": ["LOGIN", "SUPERUSER"] + } + ], + "nodes": [ + { + "name": "n1", + "host_ids": ["host-1"] + } + ], + "services": [ + { + "service_id": "mcp-server", + "service_type": "mcp", + "version": "latest", + "host_ids": ["host-1"], + "port": 8080, + "config": { + "llm_provider": "anthropic", + "llm_model": "claude-sonnet-4-5", + "anthropic_api_key": "sk-ant-..." + } + } + ] + } +} +``` + +A successful response returns **HTTP 200** with a `task` object you can poll +for completion. + +### Validation Error Response + +If the request payload fails validation, the API returns **HTTP 400** with an +`APIError`. All service validation errors use the `invalid_input` error name. + +Example — missing a required MCP config field: + +```json +{ + "name": "invalid_input", + "message": "services[0].config: missing required field 'llm_provider'" +} +``` + +Other common validation errors: + +| Condition | Example `message` | +|-----------|-------------------| +| Duplicate `service_id` | `services[1]: service IDs must be unique within a database` | +| Unsupported `service_type` | `services[0].service_type: unsupported service type 'foo' (only 'mcp' is currently supported)` | +| Bad `version` format | `services[0].version: version must be in semver format (e.g., '1.0.0') or 'latest'` | +| Missing provider API key | `services[0].config: missing required field 'anthropic_api_key' for anthropic provider` | +| Unsupported `llm_provider` | `services[0].config[llm_provider]: unsupported llm_provider 'foo' (must be one of: anthropic, openai, ollama)` | + +### Reading a Database with Service Instances + +`GET /v1/databases/my-app` returns the full database, including runtime +`service_instances`: + +```json +{ + "id": "my-app", + "state": "available", + "created_at": "2025-06-01T12:00:00Z", + "updated_at": "2025-06-01T12:05:00Z", + "spec": { + "services": [ + { + "service_id": "mcp-server", + "service_type": "mcp", + "version": "latest", + "host_ids": ["host-1"], + "port": 8080, + "config": { + "llm_provider": "anthropic", + "llm_model": "claude-sonnet-4-5" + } + } + ] + }, + "service_instances": [ + { + "service_instance_id": "my-app-mcp-server-host-1", + "service_id": "mcp-server", + "database_id": "my-app", + "host_id": "host-1", + "state": "running", + "status": { + "container_id": "a1b2c3d4e5f6", + "image_version": "latest", + "hostname": "mcp-server-host-1.internal", + "ipv4_address": "10.0.1.5", + "ports": [ + { + "name": "http", + "container_port": 8080, + "host_port": 8080 + } + ], + "health_check": { + "status": "healthy", + "message": "Service responding normally", + "checked_at": "2025-06-01T12:04:50Z" + }, + "last_health_at": "2025-06-01T12:04:50Z", + "service_ready": true + }, + "created_at": "2025-06-01T12:00:30Z", + "updated_at": "2025-06-01T12:01:00Z" + } + ] +} +``` + +> [!NOTE] +> Sensitive config keys (`anthropic_api_key`, `openai_api_key`, and any key +> matching patterns like `password`, `secret`, `token`, `credential`, +> `private_key`, `access_key`) are **stripped from all API responses**. The +> `config` object in `spec.services` will only contain non-sensitive keys. + +### Updating Services on an Existing Database + +Services are managed through the `spec.services` array in +`PUT /v1/databases/{id}`. The control plane diffs the desired state against +the current state and creates, updates, or deletes service instances +accordingly. + +**Add a service** — include it in the `services` array: + +```json +{ + "spec": { + "services": [ + { + "service_id": "mcp-server", + "service_type": "mcp", + "version": "latest", + "host_ids": ["host-1"], + "config": { + "llm_provider": "anthropic", + "llm_model": "claude-sonnet-4-5", + "anthropic_api_key": "sk-ant-..." + } + }, + { + "service_id": "mcp-analytics", + "service_type": "mcp", + "version": "1.0.0", + "host_ids": ["host-2"], + "config": { + "llm_provider": "openai", + "llm_model": "gpt-4", + "openai_api_key": "sk-..." + } + } + ] + } +} +``` + +**Remove a service** — omit it from the `services` array. The control plane +deletes the corresponding service instances: + +```json +{ + "spec": { + "services": [] + } +} +``` + +**Update a service** — change fields (e.g., `version`, `config`, `host_ids`) +in the existing entry. The control plane will update or recreate service +instances as needed. + +### Service Health & Failure in API State + +Service instance health is tracked by the **service instance monitor**, which +periodically polls the Docker Swarm orchestrator for container status. The +results are reflected in the `service_instances` array of the database +response. + +**State lifecycle:** + +| State | Meaning | +|-------|---------| +| `creating` | Instance provisioned; waiting for container to become healthy (up to 5 min timeout) | +| `running` | Container is healthy and accepting requests | +| `failed` | Container health check failed, creation timed out, or container disappeared | +| `deleting` | Instance is being removed | + +**How failures surface:** + +- When a `running` instance's health check fails, the monitor transitions it + to `failed` and populates the `error` field with a diagnostic message (e.g., + `"container is no longer healthy"`). +- When a `creating` instance exceeds the 5-minute creation timeout, it + transitions to `failed` with an error like + `"creation timeout after 5m0s - container not healthy"`. +- If the container disappears entirely (nil status from orchestrator), a + `running` instance transitions to `failed` after a 30-second grace period + with `"container status not available"`. + +Example of a failed service instance in the API response: + +```json +{ + "service_instance_id": "my-app-mcp-server-host-1", + "service_id": "mcp-server", + "database_id": "my-app", + "host_id": "host-1", + "state": "failed", + "status": null, + "error": "creation timeout after 5m0s - no status available", + "created_at": "2025-06-01T12:00:30Z", + "updated_at": "2025-06-01T12:05:30Z" +} +``` + +> [!NOTE] +> Service instance failures detected by the monitor do **not** automatically +> change the parent database's `state`. A database can be `"available"` while +> one or more of its service instances are `"failed"`. This is distinct from +> provisioning/workflow failures (e.g., an unregistered image version), which +> do set the database to `"failed"` because the workflow itself fails. Monitor +> both `database.state` and `service_instances[].state` for a complete health +> picture. + +## Checklist: Adding a New Service Type + +| Step | File | Change | +|------|------|--------| +| 1. API enum | `api/apiv1/design/database.go` | Add to `g.Enum(...)` on `service_type` | +| 2. Regenerate | — | `make -C api generate` | +| 3. Validation | `server/internal/api/apiv1/validate.go` | Add type to allowlist in `validateServiceSpec()`; add `validateMyServiceConfig()` function | +| 4. Image registry | `server/internal/orchestrator/swarm/service_images.go` | Call `versions.addServiceImage()` in `NewServiceVersions()` | +| 5. Container spec | `server/internal/orchestrator/swarm/service_spec.go` | Service-specific configuration delivery, health check, mounts, entrypoint | +| 6. Unit tests | `swarm/service_spec_test.go`, `swarm/service_images_test.go` | Add cases for new type | +| 7. Golden plan tests | `operations/update_database_test.go` | Already covered generically; regenerate with `-update` if resource shape changes | +| 8. E2E tests | `e2e/service_provisioning_test.go` | Add provision, lifecycle, stability, and failure/recovery tests | + +## What Doesn't Change + +The following are service-type-agnostic and require no modification: + +- `ServiceSpec` struct — `server/internal/database/spec.go` +- `ServiceInstance` domain model — `server/internal/database/service_instance.go` +- Workflow code — `server/internal/workflows/plan_update.go` +- Resource types — `server/internal/orchestrator/swarm/` (all five resource + types are generic) +- `GenerateServiceInstanceResources()` — + `server/internal/orchestrator/swarm/orchestrator.go` +- Operations layer — `server/internal/database/operations/` (`UpdateDatabase`, + `EndState`) +- Store/etcd layer + +## Future Work + +- **Read/write service user accounts**: Service users are currently provisioned + with the `pgedge_application_read_only` role. Some service types will require + write access (`INSERT`, `UPDATE`, `DELETE`, DDL). This will require a + mechanism for the service spec to declare the required access level and for + `ServiceUserRole` to provision the appropriate role accordingly. +- **Primary-aware database connection routing**: Services currently connect to a + Postgres instance resolved at provisioning time by `findPostgresInstance()`, + which prefers a co-located instance but does not distinguish between primary + and replica. Services that require read/write access will need their database + connection routed to the current primary, and that routing will need to + survive failovers and switchovers. +- **Persistent bind mounts**: Service containers currently have no persistent + storage (`Mounts: []mount.Mount{}`). Some services need configuration or + application state that survives container restarts (e.g., token stores, local + databases, generated config files). This will require adding bind mount + support to `ServiceContainerSpec()`, along with corresponding filesystem + directory resources to manage the host-side paths — following the same pattern + used by Patroni and pgBackRest config resources. + +## Appendix: MCP Reference Implementation for AI-Assisted Development + +> [!NOTE] +> This section is designed for consumption by Claude Code (or similar AI +> assistants). When a developer asks Claude to help add a new service type, point +> it at this document. The code below provides complete, copy-editable reference +> implementations from MCP at each touch point, so the assistant can work from +> concrete examples rather than having to read every source file independently. + +### A.1 API Enum (`api/apiv1/design/database.go`) + +The `ServiceSpec` Goa type. To add a new service type, add it to the `g.Enum()` +call on `service_type`: + +```go +// lines 125–191 +var ServiceSpec = g.Type("ServiceSpec", func() { + g.Attribute("service_id", Identifier, func() { + g.Description("The unique identifier for this service.") + g.Example("mcp-server") + g.Example("analytics-service") + g.Meta("struct:tag:json", "service_id") + }) + g.Attribute("service_type", g.String, func() { + g.Description("The type of service to run.") + g.Enum("mcp") // ← add new type here, e.g. g.Enum("mcp", "my-service") + g.Example("mcp") + g.Meta("struct:tag:json", "service_type") + }) + g.Attribute("version", g.String, func() { + g.Description("The version of the service in semver format (e.g., '1.0.0') or the literal 'latest'.") + g.Pattern(serviceVersionPattern) // `^\d+\.\d+\.\d+|latest$` + g.Example("1.0.0") + g.Example("latest") + g.Meta("struct:tag:json", "version") + }) + g.Attribute("host_ids", HostIDs, func() { + g.Description("The IDs of the hosts that should run this service.") + g.MinLength(1) + g.Meta("struct:tag:json", "host_ids") + }) + g.Attribute("port", g.Int, func() { + g.Description("The port to publish the service on the host.") + g.Minimum(0) + g.Maximum(65535) + g.Meta("struct:tag:json", "port,omitempty") + }) + g.Attribute("config", g.MapOf(g.String, g.Any), func() { + g.Description("Service-specific configuration.") + g.Meta("struct:tag:json", "config") + }) + g.Attribute("cpus", g.String, func() { + g.Description("CPU limit. Accepts SI suffix 'm', e.g. '500m'.") + g.Pattern(cpuPattern) + g.Meta("struct:tag:json", "cpus,omitempty") + }) + g.Attribute("memory", g.String, func() { + g.Description("Memory limit in SI or IEC notation, e.g. '512M', '1GiB'.") + g.MaxLength(16) + g.Meta("struct:tag:json", "memory,omitempty") + }) + + g.Required("service_id", "service_type", "version", "host_ids", "config") +}) +``` + +After editing, run `make -C api generate`. + +### A.2 Validation (`server/internal/api/apiv1/validate.go`) + +**`validateServiceSpec()`** — the dispatcher. Add your type to the allowlist and +config dispatch: + +```go +// lines 229–280 +func validateServiceSpec(svc *api.ServiceSpec, path []string) []error { + var errs []error + + serviceIDPath := appendPath(path, "service_id") + errs = append(errs, validateIdentifier(string(svc.ServiceID), serviceIDPath)) + + // ← add your type to this check + if svc.ServiceType != "mcp" { + err := fmt.Errorf("unsupported service type '%s' (only 'mcp' is currently supported)", svc.ServiceType) + errs = append(errs, newValidationError(err, appendPath(path, "service_type"))) + } + + if svc.Version != "latest" && !semverPattern.MatchString(svc.Version) { + err := errors.New("version must be in semver format (e.g., '1.0.0') or 'latest'") + errs = append(errs, newValidationError(err, appendPath(path, "version"))) + } + + seenHostIDs := make(ds.Set[string], len(svc.HostIds)) + for i, hostID := range svc.HostIds { + hostIDStr := string(hostID) + hostIDPath := appendPath(path, "host_ids", arrayIndexPath(i)) + errs = append(errs, validateIdentifier(hostIDStr, hostIDPath)) + if seenHostIDs.Has(hostIDStr) { + err := errors.New("host IDs must be unique within a service") + errs = append(errs, newValidationError(err, hostIDPath)) + } + seenHostIDs.Add(hostIDStr) + } + + // ← add dispatch for your type here + if svc.ServiceType == "mcp" { + errs = append(errs, validateMCPServiceConfig(svc.Config, appendPath(path, "config"))...) + } + + if svc.Cpus != nil { + errs = append(errs, validateCPUs(svc.Cpus, appendPath(path, "cpus"))...) + } + if svc.Memory != nil { + errs = append(errs, validateMemory(svc.Memory, appendPath(path, "memory"))...) + } + + return errs +} +``` + +**`validateMCPServiceConfig()`** — the per-type config validator to use as a +template: + +```go +// lines 283–330 +func validateMCPServiceConfig(config map[string]any, path []string) []error { + var errs []error + + requiredFields := []string{"llm_provider", "llm_model"} + for _, field := range requiredFields { + if _, ok := config[field]; !ok { + err := fmt.Errorf("missing required field '%s'", field) + errs = append(errs, newValidationError(err, path)) + } + } + + if val, exists := config["llm_provider"]; exists { + provider, ok := val.(string) + if !ok { + err := errors.New("llm_provider must be a string") + errs = append(errs, newValidationError(err, appendPath(path, mapKeyPath("llm_provider")))) + } else { + validProviders := []string{"anthropic", "openai", "ollama"} + if !slices.Contains(validProviders, provider) { + err := fmt.Errorf("unsupported llm_provider '%s' (must be one of: %s)", + provider, strings.Join(validProviders, ", ")) + errs = append(errs, newValidationError(err, appendPath(path, mapKeyPath("llm_provider")))) + } + + switch provider { + case "anthropic": + if _, ok := config["anthropic_api_key"]; !ok { + err := errors.New("missing required field 'anthropic_api_key' for anthropic provider") + errs = append(errs, newValidationError(err, path)) + } + case "openai": + if _, ok := config["openai_api_key"]; !ok { + err := errors.New("missing required field 'openai_api_key' for openai provider") + errs = append(errs, newValidationError(err, path)) + } + case "ollama": + if _, ok := config["ollama_url"]; !ok { + err := errors.New("missing required field 'ollama_url' for ollama provider") + errs = append(errs, newValidationError(err, path)) + } + } + } + } + + return errs +} +``` + +### A.3 Image Registry (`server/internal/orchestrator/swarm/service_images.go`) + +Register the image in `NewServiceVersions()`: + +```go +// lines 39–68 +func NewServiceVersions(cfg config.Config) *ServiceVersions { + versions := &ServiceVersions{ + cfg: cfg, + images: make(map[string]map[string]*ServiceImage), + } + + // MCP service versions + versions.addServiceImage("mcp", "latest", &ServiceImage{ + Tag: serviceImageTag(cfg, "postgres-mcp:latest"), + }) + + // ← add your service here: + // versions.addServiceImage("my-service", "1.0.0", &ServiceImage{ + // Tag: serviceImageTag(cfg, "my-service:1.0.0"), + // }) + + return versions +} +``` + +Supporting types: + +```go +type ServiceImage struct { + Tag string `json:"tag"` + PostgresConstraint *host.VersionConstraint `json:"postgres_constraint,omitempty"` + SpockConstraint *host.VersionConstraint `json:"spock_constraint,omitempty"` +} + +// GetServiceImage resolves (serviceType, version) → *ServiceImage. +// Returns an error if the type or version is unregistered. +func (sv *ServiceVersions) GetServiceImage(serviceType string, version string) (*ServiceImage, error) { + versionMap, ok := sv.images[serviceType] + if !ok { + return nil, fmt.Errorf("unsupported service type %q", serviceType) + } + image, ok := versionMap[version] + if !ok { + return nil, fmt.Errorf("unsupported version %q for service type %q", version, serviceType) + } + return image, nil +} + +// serviceImageTag prepends the configured registry host unless the image +// reference already contains a registry prefix. +func serviceImageTag(cfg config.Config, imageRef string) string { + if strings.Contains(imageRef, "/") { + parts := strings.Split(imageRef, "/") + firstPart := parts[0] + if strings.Contains(firstPart, ".") || strings.Contains(firstPart, ":") || firstPart == "localhost" { + return imageRef + } + } + if cfg.DockerSwarm.ImageRepositoryHost == "" { + return imageRef + } + return fmt.Sprintf("%s/%s", cfg.DockerSwarm.ImageRepositoryHost, imageRef) +} +``` + +### A.4 Container Spec (`server/internal/orchestrator/swarm/service_spec.go`) + +**`ServiceContainerSpecOptions`** — the input struct: + +```go +// lines 15–32 +type ServiceContainerSpecOptions struct { + ServiceSpec *database.ServiceSpec + ServiceInstanceID string + DatabaseID string + DatabaseName string + HostID string + ServiceName string + Hostname string + CohortMemberID string + ServiceImage *ServiceImage + Credentials *database.ServiceUser + DatabaseNetworkID string + DatabaseHost string + DatabasePort int + Port *int +} +``` + +**`ServiceContainerSpec()`** — builds the Docker Swarm `ServiceSpec`: + +```go +// lines 35–117 +func ServiceContainerSpec(opts *ServiceContainerSpecOptions) (swarm.ServiceSpec, error) { + labels := map[string]string{ + "pgedge.component": "service", + "pgedge.service.instance.id": opts.ServiceInstanceID, + "pgedge.service.id": opts.ServiceSpec.ServiceID, + "pgedge.database.id": opts.DatabaseID, + "pgedge.host.id": opts.HostID, + } + + networks := []swarm.NetworkAttachmentConfig{ + {Target: "bridge"}, + {Target: opts.DatabaseNetworkID}, + } + + env := buildServiceEnvVars(opts) + image := opts.ServiceImage.Tag + ports := buildServicePortConfig(opts.Port) + + var resources *swarm.ResourceRequirements + if opts.ServiceSpec.CPUs != nil || opts.ServiceSpec.MemoryBytes != nil { + resources = &swarm.ResourceRequirements{ + Limits: &swarm.Limit{}, + } + if opts.ServiceSpec.CPUs != nil { + resources.Limits.NanoCPUs = int64(*opts.ServiceSpec.CPUs * 1e9) + } + if opts.ServiceSpec.MemoryBytes != nil { + resources.Limits.MemoryBytes = int64(*opts.ServiceSpec.MemoryBytes) + } + } + + return swarm.ServiceSpec{ + TaskTemplate: swarm.TaskSpec{ + ContainerSpec: &swarm.ContainerSpec{ + Image: image, + Labels: labels, + Hostname: opts.Hostname, + Env: env, + Healthcheck: &container.HealthConfig{ + Test: []string{"CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"}, + StartPeriod: time.Second * 30, + Interval: time.Second * 10, + Timeout: time.Second * 5, + Retries: 3, + }, + Mounts: []mount.Mount{}, + }, + Networks: networks, + Placement: &swarm.Placement{ + Constraints: []string{ + "node.id==" + opts.CohortMemberID, + }, + }, + Resources: resources, + }, + EndpointSpec: &swarm.EndpointSpec{ + Mode: swarm.ResolutionModeVIP, + Ports: ports, + }, + Annotations: swarm.Annotations{ + Name: opts.ServiceName, + Labels: labels, + }, + }, nil +} +``` + +**`buildServiceEnvVars()`** — MCP-specific env var injection (this is expected to +change; see the caution at the top of this document): + +```go +// lines 120–168 +func buildServiceEnvVars(opts *ServiceContainerSpecOptions) []string { + env := []string{ + fmt.Sprintf("PGHOST=%s", opts.DatabaseHost), + fmt.Sprintf("PGPORT=%d", opts.DatabasePort), + fmt.Sprintf("PGDATABASE=%s", opts.DatabaseName), + "PGSSLMODE=prefer", + fmt.Sprintf("PGEDGE_SERVICE_ID=%s", opts.ServiceSpec.ServiceID), + fmt.Sprintf("PGEDGE_DATABASE_ID=%s", opts.DatabaseID), + } + + if opts.Credentials != nil { + env = append(env, + fmt.Sprintf("PGUSER=%s", opts.Credentials.Username), + fmt.Sprintf("PGPASSWORD=%s", opts.Credentials.Password), + ) + } + + // MCP-specific config → env var mapping + if provider, ok := opts.ServiceSpec.Config["llm_provider"].(string); ok { + env = append(env, fmt.Sprintf("PGEDGE_LLM_PROVIDER=%s", provider)) + } + if model, ok := opts.ServiceSpec.Config["llm_model"].(string); ok { + env = append(env, fmt.Sprintf("PGEDGE_LLM_MODEL=%s", model)) + } + + if provider, ok := opts.ServiceSpec.Config["llm_provider"].(string); ok { + switch provider { + case "anthropic": + if key, ok := opts.ServiceSpec.Config["anthropic_api_key"].(string); ok { + env = append(env, fmt.Sprintf("PGEDGE_ANTHROPIC_API_KEY=%s", key)) + } + case "openai": + if key, ok := opts.ServiceSpec.Config["openai_api_key"].(string); ok { + env = append(env, fmt.Sprintf("PGEDGE_OPENAI_API_KEY=%s", key)) + } + case "ollama": + if url, ok := opts.ServiceSpec.Config["ollama_url"].(string); ok { + env = append(env, fmt.Sprintf("PGEDGE_OLLAMA_URL=%s", url)) + } + } + } + + return env +} +``` + +**`buildServicePortConfig()`** — port publication: + +```go +// lines 175–196 +func buildServicePortConfig(port *int) []swarm.PortConfig { + if port == nil { + return nil + } + config := swarm.PortConfig{ + PublishMode: swarm.PortConfigPublishModeHost, + TargetPort: 8080, + Name: "http", + Protocol: swarm.PortConfigProtocolTCP, + } + if *port > 0 { + config.PublishedPort = uint32(*port) + } else if *port == 0 { + config.PublishedPort = 0 + } + return []swarm.PortConfig{config} +} +``` + +### A.5 Domain Model (`server/internal/database/spec.go`) + +```go +// lines 116–125 +type ServiceSpec struct { + ServiceID string `json:"service_id"` + ServiceType string `json:"service_type"` + Version string `json:"version"` + HostIDs []string `json:"host_ids"` + Config map[string]any `json:"config"` + Port *int `json:"port,omitempty"` + CPUs *float64 `json:"cpus,omitempty"` + MemoryBytes *uint64 `json:"memory,omitempty"` +} +``` + +This struct is service-type-agnostic. No changes needed when adding a new type. + +### A.6 E2E Test Pattern (`e2e/service_provisioning_test.go`) + +Complete example of provisioning a service and verifying it reaches `"running"` +state: + +```go +// lines 18–147 +func TestProvisionMCPService(t *testing.T) { + t.Parallel() + + host1 := fixture.HostIDs()[0] + + ctx, cancel := context.WithTimeout(context.Background(), 10*time.Minute) + defer cancel() + + t.Log("Creating database with MCP service") + + db := fixture.NewDatabaseFixture(ctx, t, &controlplane.CreateDatabaseRequest{ + Spec: &controlplane.DatabaseSpec{ + DatabaseName: "test_mcp_service", + DatabaseUsers: []*controlplane.DatabaseUserSpec{ + { + Username: "admin", + Password: pointerTo("testpassword"), + DbOwner: pointerTo(true), + Attributes: []string{"LOGIN", "SUPERUSER"}, + }, + }, + Port: pointerTo(0), + Nodes: []*controlplane.DatabaseNodeSpec{ + { + Name: "n1", + HostIds: []controlplane.Identifier{controlplane.Identifier(host1)}, + }, + }, + Services: []*controlplane.ServiceSpec{ + { + ServiceID: "mcp-server", + ServiceType: "mcp", + Version: "latest", + HostIds: []controlplane.Identifier{controlplane.Identifier(host1)}, + Config: map[string]any{ + "llm_provider": "anthropic", + "llm_model": "claude-sonnet-4-5", + "anthropic_api_key": "sk-ant-test-key-12345", + }, + }, + }, + }, + }) + + t.Log("Database created, verifying service instances") + + require.NotNil(t, db.ServiceInstances, "ServiceInstances should not be nil") + require.Len(t, db.ServiceInstances, 1, "Expected 1 service instance") + + serviceInstance := db.ServiceInstances[0] + assert.Equal(t, "mcp-server", serviceInstance.ServiceID) + assert.Equal(t, string(host1), serviceInstance.HostID) + assert.NotEmpty(t, serviceInstance.ServiceInstanceID) + + validStates := []string{"creating", "running"} + assert.Contains(t, validStates, serviceInstance.State) + + // Poll until running + if serviceInstance.State == "creating" { + t.Log("Service is still creating, waiting for running...") + maxWait := 5 * time.Minute + pollInterval := 5 * time.Second + deadline := time.Now().Add(maxWait) + + for time.Now().Before(deadline) { + err := db.Refresh(ctx) + require.NoError(t, err) + if len(db.ServiceInstances) > 0 && db.ServiceInstances[0].State == "running" { + break + } + time.Sleep(pollInterval) + } + + require.Len(t, db.ServiceInstances, 1) + assert.Equal(t, "running", db.ServiceInstances[0].State) + } + + // Verify status/connection info + serviceInstance = db.ServiceInstances[0] + if serviceInstance.Status != nil { + assert.NotNil(t, serviceInstance.Status.Hostname) + assert.NotNil(t, serviceInstance.Status.Ipv4Address) + + foundHTTPPort := false + for _, port := range serviceInstance.Status.Ports { + if port.Name == "http" && port.ContainerPort != nil && *port.ContainerPort == 8080 { + foundHTTPPort = true + break + } + } + assert.True(t, foundHTTPPort, "HTTP port (8080) should be configured") + } +} +``` + +### A.7 Step-by-Step Instructions for Adding a New Service + +When a developer asks you to add a new service type (e.g., `"my-service"`), +follow these steps in order: + +1. **`api/apiv1/design/database.go`**: Add `"my-service"` to the `g.Enum()` + call on `service_type` (see [A.1](#a1-api-enum-apiapiv1designdatabasego)). + Run `make -C api generate`. + +2. **`server/internal/api/apiv1/validate.go`**: Change the allowlist check in + `validateServiceSpec()` to accept `"my-service"`. Write a + `validateMyServiceConfig()` function following the pattern in + `validateMCPServiceConfig()` (see [A.2](#a2-validation-serverinternalapiapiv1validatego)). + Add a dispatch branch in `validateServiceSpec()`. + +3. **`server/internal/orchestrator/swarm/service_images.go`**: Add a + `versions.addServiceImage("my-service", ...)` call in `NewServiceVersions()` + (see [A.3](#a3-image-registry-serverinternalorchestratorswarmservice_imagesgo)). + +4. **`server/internal/orchestrator/swarm/service_spec.go`**: If your service + needs different environment variables, a different health check endpoint, + different port, or bind mounts, modify `ServiceContainerSpec()` and/or its + helpers to branch on `ServiceType` (see + [A.4](#a4-container-spec-serverinternalorchestratorswarmservice_specgo)). + +5. **`server/internal/orchestrator/swarm/service_spec_test.go`**: Add table-driven + test cases for the new service type's container spec, env vars, and port config. + +6. **`server/internal/orchestrator/swarm/service_images_test.go`**: Add test + cases for `GetServiceImage()` with the new type and version(s). + +7. **`e2e/service_provisioning_test.go`**: Add E2E tests following the patterns + in [A.6](#a6-e2e-test-pattern-e2eservice_provisioning_testgo): single-host + provision, multi-host, add/remove from existing database, stability (unrelated + update doesn't recreate), bad version failure + recovery. + +**Files that do NOT need changes**: `spec.go`, `service_instance.go`, +`plan_update.go`, `orchestrator.go`, `resources.go`, `end.go`, any store/etcd +code.