feat: scale-to-zero NAT instances for AWS by leonardosul · Pull Request #1 · MachineDotDev/nat-zero

leonardosul · 2026-02-24T07:07:11Z

Summary

Terraform module for scale-to-zero NAT instances on AWS
Go Lambda for EventBridge-driven NAT lifecycle management
CI workflows (pre-commit, go tests, integration tests)
Release-please for automated versioning and lambda.zip builds
MkDocs documentation site

Test plan

Pre-commit checks pass
Go unit tests pass
Integration tests pass (manual trigger via workflow_dispatch)

🤖 Generated with Claude Code

Terraform module for scale-to-zero NAT instances on AWS. Includes: - Go Lambda for EventBridge-driven NAT lifecycle management - Pre-commit hooks (terraform fmt/tflint/docs, go fmt/vet/staticcheck) - Integration tests against real AWS infrastructure - Release-please for automated versioning and lambda.zip builds - MkDocs documentation site Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

workflow_dispatch runs don't report as PR checks. Add pull_request with path filters so it triggers on relevant changes and shows on PRs. do_not_enforce_on_create in the ruleset means it won't block PRs that don't touch these paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Integration tests run when the 'integration-test' label is added to a PR, not on every push. Add the label to trigger, remove and re-add to re-trigger. Also available via workflow_dispatch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Rewrite README and docs to be more welcoming and compelling - Highlight Go rewrite (90% faster cold starts), real integration tests - Escape $ signs in markdown to prevent GitHub LaTeX rendering - Deduplicate terraform-docs: single source of truth via pre-commit hooks injecting into README.md and replacing docs/REFERENCE.md - Remove docs/README.md (replaced by docs/REFERENCE.md) - Move SECURITY.md and WORKFLOWS.md into docs/ - Add WORKFLOWS.md documenting CI/CD workflows and repo rulesets - Pin integration test fixture to us-east-1a (t4g.nano unsupported in us-east-1e) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add a DescribeNetworkInterfaces re-check in attachEIP after allocation to detect when another invocation already attached an EIP, and add an orphan EIP sweep in detachEIP to clean up any leaked allocations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The integration test NATScaleDown phase was failing because findSiblings counted the dying workload as a "sibling" — EventBridge fires the shutting-down event before DescribeInstances reflects the new state, so the workload still appeared as "running" in the API response. Three fixes: 1. findSiblings now accepts an excludeID parameter so the triggering instance is never counted as a sibling. 2. maybeStopNAT retries when siblings ARE found (letting eventual consistency settle) instead of giving up immediately on the first false positive. 3. When classify can't find a terminated instance (already gone from the API), the handler falls back to sweepIdleNATs which checks all running NATs in the VPC and stops any with no siblings. Also uses StopInstances Force=true since NAT instances are stateless forwarders with no filesystem to flush. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The Lambda's cleanupAll (invoked by terraform destroy) called TerminateInstances but returned immediately without waiting. Since the module's ENIs use delete_on_termination=false, they remain attached until the instance fully terminates. If Terraform proceeds to delete ENIs while the instance is still shutting down, it needs ec2:DetachNetworkInterface which users may not have. Now polls DescribeInstances until all terminated instances disappear, guaranteeing ENIs are detached before the Lambda returns and Terraform continues its destroy plan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rewrite docs with KISS philosophy: clear decision matrix tables, accurate performance numbers from real CloudWatch data, remove outdated event-driven/reactive language and Python comparison history. Net reduction of ~388 lines. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The cleanup action's waitForTermination can exceed 30s when polling for NAT instance termination during terraform destroy. This caused Sandbox.Timedout errors and left orphaned resources. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The test's Phase 4 (CleanupAction) invoked the cleanup Lambda while the Phase 3 workload was still running. When cleanup terminated the NAT, EventBridge delivered the terminated event, and the reconciler saw workloads=1 and created a new NAT. This zombie NAT then caused terraform destroy's cleanup invocation to timeout waiting for termination. Fix: terminate all test workloads before invoking cleanup, matching the production destroy ordering where Terraform deletes the EventBridge target (stopping new events) before running cleanup. Also increase Lambda timeout to 120s — the cleanup path with waitForTermination genuinely needs >30s (observed 32s in CI). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The previous fix waited for WaitUntilInstanceTerminated (90+ seconds), during which EventBridge drove normal scale-down — releasing the EIP before the test could assert it existed. Now we just wait for workloads to leave pending/running state (a few seconds), which is enough to prevent the reconciler from recreating NATs after cleanup. Removed the pre-cleanup EIP count assertion since it's not what Phase 4 tests — the point is verifying cleanup terminates NATs and releases EIPs, not that they exist beforehand. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

AWS EventBridge rules/targets are eventually consistent — events that fire within seconds of target creation may be silently dropped. The Lambda permission also needs time to propagate. Previously, terraform apply could complete and workloads could launch before EventBridge was ready to deliver events to the Lambda. This caused the first workload's pending/running events to be lost, leaving the NAT uncreated until a later event (like workload termination) happened to trigger reconciliation. Fix: add a 15-second time_sleep after the EventBridge target and Lambda permission are created. Module outputs depend on this sleep, so terraform apply doesn't return until the event pipeline is ready. Also reduce Lambda timeout from 120s to 90s (cleanup path with waitForTermination needs ~35s in practice). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When a NAT is created, EventBridge fires events immediately but EC2 API queries (both filter-based and by-ID) can return stale data for several seconds. This caused two failure modes: 1. findNATs() returned empty on NAT pending event → Lambda tried to create duplicate NAT → failed with "ENI in use" error 2. NAT showed "pending" in API when actually "running" → Lambda logged "waiting" and returned → EIP never attached Fix: - Pass trigger instance from resolveAZ() to reconcile() - If trigger is a NAT that findNATs() missed, add it to the list - If event says "running" but API says "pending", trust the event Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When EventBridge fires a "stopped" event for a NAT instance, the EC2 API may still return "stopping" due to eventual consistency. Previously the handler would wait for another event that never comes, leaving the EIP attached indefinitely. Now we trust the event state (like we already do for pending→running), allowing EIP release to proceed immediately when the NAT is truly stopped. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Always override trigger instance state with event state (EC2 API eventual consistency). Simplified from two ad-hoc checks to one centralized override. - Increase EventBridge propagation delay from 15s to 30s to reduce dropped events during initial deployment. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add reconciliation pattern history (control theory → CFEngine → Borg → K8s) - Document EC2 API eventual consistency handling - Document EventBridge propagation delay (60s wait) - Document NAT Force stop behavior and Lambda timeout - Clarify config versioning two-event replacement process - Add IDE directories to .gitignore Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

leonardosul had a problem deploying to integration February 24, 2026 07:38 — with GitHub Actions Failure

leonardosul and others added 2 commits February 24, 2026 17:40

ci: trigger PR checks

9bb7255

leonardosul had a problem deploying to integration February 24, 2026 07:42 — with GitHub Actions Failure

leonardosul added the integration-test Triggers integration tests label Feb 24, 2026

leonardosul had a problem deploying to integration February 24, 2026 07:45 — with GitHub Actions Failure

ci: re-trigger checks

1b50b10

leonardosul added integration-test Triggers integration tests and removed integration-test Triggers integration tests labels Feb 24, 2026

leonardosul had a problem deploying to integration February 24, 2026 07:57 — with GitHub Actions Failure

leonardosul added integration-test Triggers integration tests and removed integration-test Triggers integration tests labels Feb 24, 2026

leonardosul had a problem deploying to integration February 24, 2026 08:25 — with GitHub Actions Failure

leonardosul had a problem deploying to integration February 24, 2026 08:38 — with GitHub Actions Failure

leonardosul had a problem deploying to integration February 24, 2026 09:30 — with GitHub Actions Failure

leonardosul had a problem deploying to integration February 24, 2026 09:44 — with GitHub Actions Failure

leonardosul added integration-test Triggers integration tests and removed integration-test Triggers integration tests labels Feb 24, 2026

leonardosul had a problem deploying to integration February 24, 2026 21:53 — with GitHub Actions Failure

leonardosul added integration-test Triggers integration tests and removed integration-test Triggers integration tests labels Feb 24, 2026

leonardosul had a problem deploying to integration February 24, 2026 22:32 — with GitHub Actions Failure

leonardosul added integration-test Triggers integration tests and removed integration-test Triggers integration tests labels Feb 25, 2026

leonardosul temporarily deployed to integration February 25, 2026 00:44 — with GitHub Actions Inactive

leonardosul had a problem deploying to integration February 25, 2026 19:27 — with GitHub Actions Failure

leonardosul added integration-test Triggers integration tests and removed integration-test Triggers integration tests labels Feb 25, 2026

leonardosul had a problem deploying to integration February 25, 2026 19:43 — with GitHub Actions Failure

leonardosul added integration-test Triggers integration tests and removed integration-test Triggers integration tests labels Feb 25, 2026

leonardosul had a problem deploying to integration February 25, 2026 23:31 — with GitHub Actions Failure

leonardosul added integration-test Triggers integration tests and removed integration-test Triggers integration tests labels Feb 26, 2026

leonardosul had a problem deploying to integration February 26, 2026 00:03 — with GitHub Actions Failure

leonardosul added integration-test Triggers integration tests and removed integration-test Triggers integration tests labels Feb 26, 2026

leonardosul had a problem deploying to integration February 26, 2026 00:09 — with GitHub Actions Failure

leonardosul added integration-test Triggers integration tests and removed integration-test Triggers integration tests labels Feb 26, 2026

leonardosul had a problem deploying to integration February 26, 2026 00:25 — with GitHub Actions Failure

leonardosul added integration-test Triggers integration tests and removed integration-test Triggers integration tests labels Feb 26, 2026

leonardosul and others added 8 commits February 26, 2026 10:52

docs: add missing time_sleep.eventbridge_propagation to terraform-docs

ef4a5da

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: increase EventBridge propagation delay to 60s

83ed7d7

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: scale-to-zero NAT instances for AWS#1

feat: scale-to-zero NAT instances for AWS#1
leonardosul merged 30 commits intomainfrom
feat/init

leonardosul commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leonardosul commented Feb 24, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant