Skip to content

feat: scale-to-zero NAT instances for AWS#1

Merged
leonardosul merged 30 commits intomainfrom
feat/init
Feb 26, 2026
Merged

feat: scale-to-zero NAT instances for AWS#1
leonardosul merged 30 commits intomainfrom
feat/init

Conversation

@leonardosul
Copy link
Contributor

Summary

  • Terraform module for scale-to-zero NAT instances on AWS
  • Go Lambda for EventBridge-driven NAT lifecycle management
  • CI workflows (pre-commit, go tests, integration tests)
  • Release-please for automated versioning and lambda.zip builds
  • MkDocs documentation site

Test plan

  • Pre-commit checks pass
  • Go unit tests pass
  • Integration tests pass (manual trigger via workflow_dispatch)

🤖 Generated with Claude Code

Terraform module for scale-to-zero NAT instances on AWS. Includes:
- Go Lambda for EventBridge-driven NAT lifecycle management
- Pre-commit hooks (terraform fmt/tflint/docs, go fmt/vet/staticcheck)
- Integration tests against real AWS infrastructure
- Release-please for automated versioning and lambda.zip builds
- MkDocs documentation site

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
leonardosul and others added 2 commits February 24, 2026 17:40
workflow_dispatch runs don't report as PR checks. Add pull_request
with path filters so it triggers on relevant changes and shows on PRs.
do_not_enforce_on_create in the ruleset means it won't block PRs that
don't touch these paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Integration tests run when the 'integration-test' label is added to a
PR, not on every push. Add the label to trigger, remove and re-add to
re-trigger. Also available via workflow_dispatch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@leonardosul leonardosul added the integration-test Triggers integration tests label Feb 24, 2026
@leonardosul leonardosul added integration-test Triggers integration tests and removed integration-test Triggers integration tests labels Feb 24, 2026
@leonardosul leonardosul added integration-test Triggers integration tests and removed integration-test Triggers integration tests labels Feb 24, 2026
- Rewrite README and docs to be more welcoming and compelling
- Highlight Go rewrite (90% faster cold starts), real integration tests
- Escape $ signs in markdown to prevent GitHub LaTeX rendering
- Deduplicate terraform-docs: single source of truth via pre-commit hooks
  injecting into README.md and replacing docs/REFERENCE.md
- Remove docs/README.md (replaced by docs/REFERENCE.md)
- Move SECURITY.md and WORKFLOWS.md into docs/
- Add WORKFLOWS.md documenting CI/CD workflows and repo rulesets
- Pin integration test fixture to us-east-1a (t4g.nano unsupported in us-east-1e)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@leonardosul leonardosul added integration-test Triggers integration tests and removed integration-test Triggers integration tests labels Feb 24, 2026
Add a DescribeNetworkInterfaces re-check in attachEIP after allocation
to detect when another invocation already attached an EIP, and add an
orphan EIP sweep in detachEIP to clean up any leaked allocations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@leonardosul leonardosul added integration-test Triggers integration tests and removed integration-test Triggers integration tests labels Feb 24, 2026
The integration test NATScaleDown phase was failing because
findSiblings counted the dying workload as a "sibling" — EventBridge
fires the shutting-down event before DescribeInstances reflects the new
state, so the workload still appeared as "running" in the API response.

Three fixes:

1. findSiblings now accepts an excludeID parameter so the triggering
   instance is never counted as a sibling.

2. maybeStopNAT retries when siblings ARE found (letting eventual
   consistency settle) instead of giving up immediately on the first
   false positive.

3. When classify can't find a terminated instance (already gone from
   the API), the handler falls back to sweepIdleNATs which checks all
   running NATs in the VPC and stops any with no siblings.

Also uses StopInstances Force=true since NAT instances are stateless
forwarders with no filesystem to flush.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@leonardosul leonardosul added integration-test Triggers integration tests and removed integration-test Triggers integration tests labels Feb 25, 2026
The Lambda's cleanupAll (invoked by terraform destroy) called
TerminateInstances but returned immediately without waiting. Since
the module's ENIs use delete_on_termination=false, they remain attached
until the instance fully terminates. If Terraform proceeds to delete
ENIs while the instance is still shutting down, it needs
ec2:DetachNetworkInterface which users may not have.

Now polls DescribeInstances until all terminated instances disappear,
guaranteeing ENIs are detached before the Lambda returns and Terraform
continues its destroy plan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@leonardosul leonardosul added integration-test Triggers integration tests and removed integration-test Triggers integration tests labels Feb 25, 2026
@leonardosul leonardosul added integration-test Triggers integration tests and removed integration-test Triggers integration tests labels Feb 25, 2026
Rewrite docs with KISS philosophy: clear decision matrix tables,
accurate performance numbers from real CloudWatch data, remove
outdated event-driven/reactive language and Python comparison history.
Net reduction of ~388 lines.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@leonardosul leonardosul added integration-test Triggers integration tests and removed integration-test Triggers integration tests labels Feb 26, 2026
@leonardosul leonardosul added integration-test Triggers integration tests and removed integration-test Triggers integration tests labels Feb 26, 2026
The cleanup action's waitForTermination can exceed 30s when polling
for NAT instance termination during terraform destroy. This caused
Sandbox.Timedout errors and left orphaned resources.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@leonardosul leonardosul added integration-test Triggers integration tests and removed integration-test Triggers integration tests labels Feb 26, 2026
The test's Phase 4 (CleanupAction) invoked the cleanup Lambda while
the Phase 3 workload was still running. When cleanup terminated the
NAT, EventBridge delivered the terminated event, and the reconciler
saw workloads=1 and created a new NAT. This zombie NAT then caused
terraform destroy's cleanup invocation to timeout waiting for
termination.

Fix: terminate all test workloads before invoking cleanup, matching
the production destroy ordering where Terraform deletes the
EventBridge target (stopping new events) before running cleanup.

Also increase Lambda timeout to 120s — the cleanup path with
waitForTermination genuinely needs >30s (observed 32s in CI).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@leonardosul leonardosul added integration-test Triggers integration tests and removed integration-test Triggers integration tests labels Feb 26, 2026
leonardosul and others added 8 commits February 26, 2026 10:52
The previous fix waited for WaitUntilInstanceTerminated (90+ seconds),
during which EventBridge drove normal scale-down — releasing the EIP
before the test could assert it existed. Now we just wait for workloads
to leave pending/running state (a few seconds), which is enough to
prevent the reconciler from recreating NATs after cleanup.

Removed the pre-cleanup EIP count assertion since it's not what Phase 4
tests — the point is verifying cleanup terminates NATs and releases
EIPs, not that they exist beforehand.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
AWS EventBridge rules/targets are eventually consistent — events that
fire within seconds of target creation may be silently dropped. The
Lambda permission also needs time to propagate.

Previously, terraform apply could complete and workloads could launch
before EventBridge was ready to deliver events to the Lambda. This
caused the first workload's pending/running events to be lost, leaving
the NAT uncreated until a later event (like workload termination)
happened to trigger reconciliation.

Fix: add a 15-second time_sleep after the EventBridge target and
Lambda permission are created. Module outputs depend on this sleep,
so terraform apply doesn't return until the event pipeline is ready.

Also reduce Lambda timeout from 120s to 90s (cleanup path with
waitForTermination needs ~35s in practice).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When a NAT is created, EventBridge fires events immediately but EC2 API
queries (both filter-based and by-ID) can return stale data for several
seconds. This caused two failure modes:

1. findNATs() returned empty on NAT pending event → Lambda tried to
   create duplicate NAT → failed with "ENI in use" error

2. NAT showed "pending" in API when actually "running" → Lambda logged
   "waiting" and returned → EIP never attached

Fix:
- Pass trigger instance from resolveAZ() to reconcile()
- If trigger is a NAT that findNATs() missed, add it to the list
- If event says "running" but API says "pending", trust the event

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When EventBridge fires a "stopped" event for a NAT instance, the EC2 API
may still return "stopping" due to eventual consistency. Previously the
handler would wait for another event that never comes, leaving the EIP
attached indefinitely.

Now we trust the event state (like we already do for pending→running),
allowing EIP release to proceed immediately when the NAT is truly stopped.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Always override trigger instance state with event state (EC2 API eventual
  consistency). Simplified from two ad-hoc checks to one centralized override.
- Increase EventBridge propagation delay from 15s to 30s to reduce dropped
  events during initial deployment.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add reconciliation pattern history (control theory → CFEngine → Borg → K8s)
- Document EC2 API eventual consistency handling
- Document EventBridge propagation delay (60s wait)
- Document NAT Force stop behavior and Lambda timeout
- Clarify config versioning two-event replacement process
- Add IDE directories to .gitignore

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration-test Triggers integration tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant