feat: scale-to-zero NAT instances for AWS#1
Merged
leonardosul merged 30 commits intomainfrom Feb 26, 2026
Merged
Conversation
Terraform module for scale-to-zero NAT instances on AWS. Includes: - Go Lambda for EventBridge-driven NAT lifecycle management - Pre-commit hooks (terraform fmt/tflint/docs, go fmt/vet/staticcheck) - Integration tests against real AWS infrastructure - Release-please for automated versioning and lambda.zip builds - MkDocs documentation site Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
workflow_dispatch runs don't report as PR checks. Add pull_request with path filters so it triggers on relevant changes and shows on PRs. do_not_enforce_on_create in the ruleset means it won't block PRs that don't touch these paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Integration tests run when the 'integration-test' label is added to a PR, not on every push. Add the label to trigger, remove and re-add to re-trigger. Also available via workflow_dispatch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rewrite README and docs to be more welcoming and compelling - Highlight Go rewrite (90% faster cold starts), real integration tests - Escape $ signs in markdown to prevent GitHub LaTeX rendering - Deduplicate terraform-docs: single source of truth via pre-commit hooks injecting into README.md and replacing docs/REFERENCE.md - Remove docs/README.md (replaced by docs/REFERENCE.md) - Move SECURITY.md and WORKFLOWS.md into docs/ - Add WORKFLOWS.md documenting CI/CD workflows and repo rulesets - Pin integration test fixture to us-east-1a (t4g.nano unsupported in us-east-1e) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a DescribeNetworkInterfaces re-check in attachEIP after allocation to detect when another invocation already attached an EIP, and add an orphan EIP sweep in detachEIP to clean up any leaked allocations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The integration test NATScaleDown phase was failing because findSiblings counted the dying workload as a "sibling" — EventBridge fires the shutting-down event before DescribeInstances reflects the new state, so the workload still appeared as "running" in the API response. Three fixes: 1. findSiblings now accepts an excludeID parameter so the triggering instance is never counted as a sibling. 2. maybeStopNAT retries when siblings ARE found (letting eventual consistency settle) instead of giving up immediately on the first false positive. 3. When classify can't find a terminated instance (already gone from the API), the handler falls back to sweepIdleNATs which checks all running NATs in the VPC and stops any with no siblings. Also uses StopInstances Force=true since NAT instances are stateless forwarders with no filesystem to flush. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Lambda's cleanupAll (invoked by terraform destroy) called TerminateInstances but returned immediately without waiting. Since the module's ENIs use delete_on_termination=false, they remain attached until the instance fully terminates. If Terraform proceeds to delete ENIs while the instance is still shutting down, it needs ec2:DetachNetworkInterface which users may not have. Now polls DescribeInstances until all terminated instances disappear, guaranteeing ENIs are detached before the Lambda returns and Terraform continues its destroy plan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rewrite docs with KISS philosophy: clear decision matrix tables, accurate performance numbers from real CloudWatch data, remove outdated event-driven/reactive language and Python comparison history. Net reduction of ~388 lines. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The cleanup action's waitForTermination can exceed 30s when polling for NAT instance termination during terraform destroy. This caused Sandbox.Timedout errors and left orphaned resources. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The test's Phase 4 (CleanupAction) invoked the cleanup Lambda while the Phase 3 workload was still running. When cleanup terminated the NAT, EventBridge delivered the terminated event, and the reconciler saw workloads=1 and created a new NAT. This zombie NAT then caused terraform destroy's cleanup invocation to timeout waiting for termination. Fix: terminate all test workloads before invoking cleanup, matching the production destroy ordering where Terraform deletes the EventBridge target (stopping new events) before running cleanup. Also increase Lambda timeout to 120s — the cleanup path with waitForTermination genuinely needs >30s (observed 32s in CI). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The previous fix waited for WaitUntilInstanceTerminated (90+ seconds), during which EventBridge drove normal scale-down — releasing the EIP before the test could assert it existed. Now we just wait for workloads to leave pending/running state (a few seconds), which is enough to prevent the reconciler from recreating NATs after cleanup. Removed the pre-cleanup EIP count assertion since it's not what Phase 4 tests — the point is verifying cleanup terminates NATs and releases EIPs, not that they exist beforehand. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
AWS EventBridge rules/targets are eventually consistent — events that fire within seconds of target creation may be silently dropped. The Lambda permission also needs time to propagate. Previously, terraform apply could complete and workloads could launch before EventBridge was ready to deliver events to the Lambda. This caused the first workload's pending/running events to be lost, leaving the NAT uncreated until a later event (like workload termination) happened to trigger reconciliation. Fix: add a 15-second time_sleep after the EventBridge target and Lambda permission are created. Module outputs depend on this sleep, so terraform apply doesn't return until the event pipeline is ready. Also reduce Lambda timeout from 120s to 90s (cleanup path with waitForTermination needs ~35s in practice). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When a NAT is created, EventBridge fires events immediately but EC2 API queries (both filter-based and by-ID) can return stale data for several seconds. This caused two failure modes: 1. findNATs() returned empty on NAT pending event → Lambda tried to create duplicate NAT → failed with "ENI in use" error 2. NAT showed "pending" in API when actually "running" → Lambda logged "waiting" and returned → EIP never attached Fix: - Pass trigger instance from resolveAZ() to reconcile() - If trigger is a NAT that findNATs() missed, add it to the list - If event says "running" but API says "pending", trust the event Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When EventBridge fires a "stopped" event for a NAT instance, the EC2 API may still return "stopping" due to eventual consistency. Previously the handler would wait for another event that never comes, leaving the EIP attached indefinitely. Now we trust the event state (like we already do for pending→running), allowing EIP release to proceed immediately when the NAT is truly stopped. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Always override trigger instance state with event state (EC2 API eventual consistency). Simplified from two ad-hoc checks to one centralized override. - Increase EventBridge propagation delay from 15s to 30s to reduce dropped events during initial deployment. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add reconciliation pattern history (control theory → CFEngine → Borg → K8s) - Document EC2 API eventual consistency handling - Document EventBridge propagation delay (60s wait) - Document NAT Force stop behavior and Lambda timeout - Clarify config versioning two-event replacement process - Add IDE directories to .gitignore Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test plan
🤖 Generated with Claude Code