Skip to content

fix: enable Spock repair mode for service user creation#287

Open
rshoemaker wants to merge 2 commits intomainfrom
fix/PLAT-478/svc_user_tx_with_repair_mode
Open

fix: enable Spock repair mode for service user creation#287
rshoemaker wants to merge 2 commits intomainfrom
fix/PLAT-478/svc_user_tx_with_repair_mode

Conversation

@rshoemaker
Copy link
Contributor

@rshoemaker rshoemaker commented Mar 6, 2026

fix: resolve concurrent service user creation failures

Summary

  • Enable Spock repair mode during service user role creation to prevent replication conflicts in multi-node database topologies
  • Add retry with exponential backoff to handle transient tuple concurrently updated errors when multiple service user roles are created concurrently against the same Postgres instance
  • Connect to the application database (where Spock is installed) instead of postgres so repair mode is correctly enabled

Context

TestProvisionMultiHostMCPService was failing intermittently with:

ERROR: tuple concurrently updated (SQLSTATE XX000)

When a database has multiple MCP service instances (e.g., on host-1, host-2, host-3), each gets its own ServiceUserRole resource. These execute concurrently on the same Postgres instance, and their GRANT statements can conflict at the catalog level. The fix addresses this with two layers:

  1. Repair mode prevents Spock from replicating role/grant DDL in multi-node topologies
  2. Retry (3 attempts, 500ms exponential backoff) handles same-instance concurrent catalog modifications

Verified with 30 consecutive test runs, 0 failures.

Test plan

  • Unit tests pass (make test, 640 tests)
  • TestProvisionMultiHostMCPService passes 30/30 runs (previously failed ~2-3/10)

PLAT-478

Wrap service user role creation in a transaction with Spock repair
mode enabled. This prevents "tuple concurrently updated" errors
when the same user is being created on multiple instances
simultaneously and Spock replicates the changes.
@coderabbitai
Copy link

coderabbitai bot commented Mar 6, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3c75db42-18e5-4f54-9045-604488275cfe

📥 Commits

Reviewing files that changed from the base of the PR and between c216d57 and 3f4a388.

📒 Files selected for processing (1)
  • server/internal/orchestrator/swarm/service_user_role.go

📝 Walkthrough

Walkthrough

Refactors service user role creation to retry the create flow on transient errors, perform all DDL and grant statements inside a transaction connected to the target database, enable Spock repair mode when available, and add explicit commit error handling. connectToPrimary now accepts a database name.

Changes

Cohort / File(s) Summary
Service user role implementation
server/internal/orchestrator/swarm/service_user_role.go
Reworked create flow to retry full user creation on transient SQLSTATE XX000 errors and moved role creation into a new createUserRole method. All DDL and GRANT statements now execute on a transaction handle (Begin/Commit with rollback on error). Added Spock repair-mode check/enable within the transaction. Added explicit commit error wrapping. Updated connectToPrimary signature to accept dbName and adjusted call sites to pass r.DatabaseName for create and "postgres" for delete. Switched DSN construction from AdminDSN("postgres") to AdminDSN(dbName). (+47/-7)

Poem

🐇 I hop through code with careful paws,
Began a tx, avoided flaws,
Enabled Spock, then rolled no more,
Committed safe on database floor,
Retry, log, and then applause!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change—enabling Spock repair mode for service user creation to fix concurrent tuple-update errors.
Description check ✅ Passed The description covers most required sections (Summary, Changes, Test plan), follows Conventional Commits format, and provides comprehensive context including issue link (PLAT-478) and test results.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/PLAT-478/svc_user_tx_with_repair_mode

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/internal/orchestrator/swarm/service_user_role.go`:
- Around line 130-133: The call to postgres.IsSpockEnabled().Scalar silently
hides query errors because Scalar currently returns nil on row.Scan failures;
update the Scalar implementation in the postgres statement (the method that
implements Scalar on the Statement/Query type) so that any row.Scan error is
returned to the caller instead of nil, ensuring postgres.IsSpockEnabled().Scalar
propagates the scan/DB error back to the caller (so the existing error handling
in service_user_role.go around enabled, err := ... Scalar(ctx, tx) works as
intended).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c1ad74e6-e8db-4423-8957-d107629152ec

📥 Commits

Reviewing files that changed from the base of the PR and between 87528bb and c216d57.

📒 Files selected for processing (1)
  • server/internal/orchestrator/swarm/service_user_role.go

Wrap createUserRole in a retry loop (3 attempts, exponential backoff)
to handle transient "tuple concurrently updated" errors when multiple
service user roles are created concurrently against the same Postgres
instance.

Also connect to the application database instead of "postgres" so
that Spock repair mode is correctly enabled (Spock is installed
per-database).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant