From 048cdeaf947e3044d29682e8ced810fc7df0ca67 Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Tue, 24 Feb 2026 17:06:46 +1000 Subject: [PATCH 01/30] feat: nat-zero module, CI, and release automation Terraform module for scale-to-zero NAT instances on AWS. Includes: - Go Lambda for EventBridge-driven NAT lifecycle management - Pre-commit hooks (terraform fmt/tflint/docs, go fmt/vet/staticcheck) - Integration tests against real AWS infrastructure - Release-please for automated versioning and lambda.zip builds - MkDocs documentation site Co-Authored-By: Claude Opus 4.6 --- .github/workflows/docs.yml | 29 + .github/workflows/go-tests.yml | 29 + .github/workflows/integration-tests.yml | 45 ++ .github/workflows/precommit.yml | 37 + .github/workflows/release-please.yml | 58 ++ .gitignore | 24 + .pre-commit-config.yaml | 44 ++ .release-please-manifest.json | 3 + .terraform-docs.yml | 8 + LICENSE | 21 + SECURITY.md | 9 + cmd/lambda/ec2iface.go | 25 + cmd/lambda/ec2ops.go | 579 ++++++++++++++ cmd/lambda/ec2ops_test.go | 909 ++++++++++++++++++++++ cmd/lambda/go.mod | 24 + cmd/lambda/go.sum | 38 + cmd/lambda/handler.go | 139 ++++ cmd/lambda/handler_test.go | 704 +++++++++++++++++ cmd/lambda/main.go | 38 + cmd/lambda/mock_test.go | 229 ++++++ cmd/lambda/perf.go | 15 + docs/ARCHITECTURE.md | 283 +++++++ docs/EXAMPLES.md | 135 ++++ docs/INDEX.md | 194 +++++ docs/PERFORMANCE.md | 155 ++++ docs/README.md | 86 +++ docs/REFERENCE.md | 86 +++ docs/TESTING.md | 169 ++++ eventbridge.tf | 39 + examples/basic/main.tf | 67 ++ iam.tf | 117 +++ lambda.tf | 92 +++ launch_template.tf | 79 ++ mkdocs.yml | 14 + network.tf | 93 +++ outputs.tf | 34 + release-please-config.json | 15 + tests/integration/fixture/main.tf | 95 +++ tests/integration/go.mod | 59 ++ tests/integration/go.sum | 974 ++++++++++++++++++++++++ tests/integration/nat_zero_test.go | 888 +++++++++++++++++++++ variables.tf | 147 ++++ versions.tf | 18 + 43 files changed, 6846 insertions(+) create mode 100644 .github/workflows/docs.yml create mode 100644 .github/workflows/go-tests.yml create mode 100644 .github/workflows/integration-tests.yml create mode 100644 .github/workflows/precommit.yml create mode 100644 .github/workflows/release-please.yml create mode 100644 .gitignore create mode 100644 .pre-commit-config.yaml create mode 100644 .release-please-manifest.json create mode 100644 .terraform-docs.yml create mode 100644 LICENSE create mode 100644 SECURITY.md create mode 100644 cmd/lambda/ec2iface.go create mode 100644 cmd/lambda/ec2ops.go create mode 100644 cmd/lambda/ec2ops_test.go create mode 100644 cmd/lambda/go.mod create mode 100644 cmd/lambda/go.sum create mode 100644 cmd/lambda/handler.go create mode 100644 cmd/lambda/handler_test.go create mode 100644 cmd/lambda/main.go create mode 100644 cmd/lambda/mock_test.go create mode 100644 cmd/lambda/perf.go create mode 100644 docs/ARCHITECTURE.md create mode 100644 docs/EXAMPLES.md create mode 100644 docs/INDEX.md create mode 100644 docs/PERFORMANCE.md create mode 100644 docs/README.md create mode 100644 docs/REFERENCE.md create mode 100644 docs/TESTING.md create mode 100644 eventbridge.tf create mode 100644 examples/basic/main.tf create mode 100644 iam.tf create mode 100644 lambda.tf create mode 100644 launch_template.tf create mode 100644 mkdocs.yml create mode 100644 network.tf create mode 100644 outputs.tf create mode 100644 release-please-config.json create mode 100644 tests/integration/fixture/main.tf create mode 100644 tests/integration/go.mod create mode 100644 tests/integration/go.sum create mode 100644 tests/integration/nat_zero_test.go create mode 100644 variables.tf create mode 100644 versions.tf diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml new file mode 100644 index 0000000..96b4245 --- /dev/null +++ b/.github/workflows/docs.yml @@ -0,0 +1,29 @@ +name: Docs + +on: + push: + branches: [main] + paths: + - "docs/**" + - "mkdocs.yml" + - "README.md" + - "*.tf" + +permissions: + contents: write + +jobs: + deploy: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4 + + - uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5 + with: + python-version: "3.12" + + - name: Install mkdocs-material + run: pip install mkdocs-material + + - name: Deploy docs + run: mkdocs gh-deploy --force diff --git a/.github/workflows/go-tests.yml b/.github/workflows/go-tests.yml new file mode 100644 index 0000000..401ccda --- /dev/null +++ b/.github/workflows/go-tests.yml @@ -0,0 +1,29 @@ +name: Go Tests + +on: + pull_request: + paths: + - "cmd/lambda/**" + push: + branches: [main] + paths: + - "cmd/lambda/**" + +permissions: + contents: read + +jobs: + go-test: + runs-on: ubuntu-latest + defaults: + run: + working-directory: cmd/lambda + steps: + - uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4 + + - uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5 + with: + go-version-file: cmd/lambda/go.mod + + - name: Test + run: go test -v -race ./... diff --git a/.github/workflows/integration-tests.yml b/.github/workflows/integration-tests.yml new file mode 100644 index 0000000..3d7e9a2 --- /dev/null +++ b/.github/workflows/integration-tests.yml @@ -0,0 +1,45 @@ +name: Integration Tests + +on: + workflow_dispatch: + +concurrency: + group: nat-zero-integration + cancel-in-progress: false + +permissions: + id-token: write + contents: read + +jobs: + integration-test: + runs-on: ubuntu-latest + timeout-minutes: 15 + environment: integration + steps: + - uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4 + + - uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5 + with: + go-version-file: cmd/lambda/go.mod + + - uses: hashicorp/setup-terraform@b9cd54a3c349d3f38e8881555d616ced269862dd # v3 + with: + terraform_wrapper: false + + - uses: aws-actions/configure-aws-credentials@7474bc4690e29a8392af63c5b98e7449536d5c3a # v4 + with: + role-to-assume: ${{ secrets.INTEGRATION_ROLE_ARN }} + aws-region: us-east-1 + + - name: Build Lambda binary + working-directory: cmd/lambda + run: | + GOOS=linux GOARCH=arm64 CGO_ENABLED=0 go build -tags lambda.norpc -ldflags='-s -w' -o bootstrap + zip lambda.zip bootstrap + mkdir -p ../../.build + cp lambda.zip ../../.build/lambda.zip + + - name: Test + working-directory: tests/integration + run: go test -v -timeout 10m -count=1 diff --git a/.github/workflows/precommit.yml b/.github/workflows/precommit.yml new file mode 100644 index 0000000..f243198 --- /dev/null +++ b/.github/workflows/precommit.yml @@ -0,0 +1,37 @@ +name: Pre-commit + +on: + pull_request: + push: + branches: [main] + paths: + - "*.tf" + - "cmd/lambda/**" + - ".pre-commit-config.yaml" + - ".terraform-docs.yml" + +permissions: + contents: read + +jobs: + precommit: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4 + + - uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5 + with: + go-version-file: cmd/lambda/go.mod + + - uses: hashicorp/setup-terraform@b9cd54a3c349d3f38e8881555d616ced269862dd # v3 + + - name: Install tools + run: | + go install honnef.co/go/tools/cmd/staticcheck@latest + curl -s https://raw.githubusercontent.com/terraform-linters/tflint/master/install_linux.sh | bash + + - uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5 + with: + python-version: "3.12" + + - uses: pre-commit/action@2c7b3805fd2a0fd8c1884dcaebf91fc102a13ecd # v3.0.1 diff --git a/.github/workflows/release-please.yml b/.github/workflows/release-please.yml new file mode 100644 index 0000000..9c35b37 --- /dev/null +++ b/.github/workflows/release-please.yml @@ -0,0 +1,58 @@ +name: Release + +on: + push: + branches: [main] + workflow_dispatch: + +permissions: + contents: write + pull-requests: write + +jobs: + release-please: + runs-on: ubuntu-latest + outputs: + release_created: ${{ steps.release.outputs.release_created }} + tag_name: ${{ steps.release.outputs.tag_name }} + steps: + - uses: googleapis/release-please-action@16a9c90856f42705d54a6fda1823352bdc62cf38 # v4 + id: release + with: + config-file: release-please-config.json + manifest-file: .release-please-manifest.json + + build-lambda: + needs: release-please + if: needs.release-please.outputs.release_created == 'true' + runs-on: ubuntu-latest + defaults: + run: + working-directory: cmd/lambda + steps: + - uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4 + + - uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5 + with: + go-version-file: cmd/lambda/go.mod + + - name: Build + run: GOOS=linux GOARCH=arm64 CGO_ENABLED=0 go build -tags lambda.norpc -ldflags='-s -w' -o bootstrap + + - name: Package + run: zip lambda.zip bootstrap + + - name: Upload to versioned release + env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + run: gh release upload "${{ needs.release-please.outputs.tag_name }}" lambda.zip --clobber + + - name: Update rolling latest release + env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + run: | + gh release create nat-zero-lambda-latest \ + --title "nat-zero Lambda (latest)" \ + --notes "Auto-built Go Lambda binary from ${{ needs.release-please.outputs.tag_name }}" \ + --latest=false 2>/dev/null || true + gh release upload nat-zero-lambda-latest lambda.zip --clobber diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..cc9f64b --- /dev/null +++ b/.gitignore @@ -0,0 +1,24 @@ +# Terraform +.terraform/ +.terraform.lock.hcl +*.tfstate +*.tfstate.backup +*.tfplan + +# Lambda build artifacts +.build/ +cmd/lambda/lambda +cmd/lambda/bootstrap +*.zip + +# Go +vendor/ + +# Test cache +.pytest_cache/ + +# OS +.DS_Store + +# AI +.claude/ diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml new file mode 100644 index 0000000..d0392cb --- /dev/null +++ b/.pre-commit-config.yaml @@ -0,0 +1,44 @@ +repos: + - repo: https://github.com/pre-commit/pre-commit-hooks + rev: v4.3.0 + hooks: + - id: check-yaml + args: ["--unsafe"] + - id: end-of-file-fixer + - id: trailing-whitespace + - id: check-toml + - id: check-json + - repo: https://github.com/TekWizely/pre-commit-golang + rev: v1.0.0-rc.1 + hooks: + - id: go-fmt + name: go fmt + - id: go-vet-repo-mod + name: go vet + - id: go-test-mod + name: go test + exclude: "tests/integration/" + - repo: local + hooks: + - id: go-staticcheck + name: go staticcheck + language: system + entry: bash -c 'export PATH="$HOME/go/bin:$PATH" && cd cmd/lambda && staticcheck ./...' + files: '\.go$' + exclude: "tests/integration/" + pass_filenames: false + - repo: https://github.com/zricethezav/gitleaks + rev: v8.16.4 + hooks: + - id: gitleaks + - repo: https://github.com/antonbabenko/pre-commit-terraform.git + rev: v1.77.0 + hooks: + - id: terraform_fmt + - id: terraform_tflint + - repo: https://github.com/terraform-docs/terraform-docs + rev: "v0.16.0" + hooks: + - id: terraform-docs-go + name: terraform-docs + args: ["-c", ".terraform-docs.yml", "markdown", "table", "--output-file", "docs/README.md", "."] diff --git a/.release-please-manifest.json b/.release-please-manifest.json new file mode 100644 index 0000000..e18ee07 --- /dev/null +++ b/.release-please-manifest.json @@ -0,0 +1,3 @@ +{ + ".": "0.0.0" +} diff --git a/.terraform-docs.yml b/.terraform-docs.yml new file mode 100644 index 0000000..3c016eb --- /dev/null +++ b/.terraform-docs.yml @@ -0,0 +1,8 @@ +formatter: "markdown" + +output: + file: "docs/README.md" + mode: replace + template: |- + {{ .Content }} + {{/** End of file fixer */}} diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..e79c57c --- /dev/null +++ b/LICENSE @@ -0,0 +1,21 @@ +MIT License + +Copyright (c) 2026 MachineDotDev contributors + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. diff --git a/SECURITY.md b/SECURITY.md new file mode 100644 index 0000000..d004b3b --- /dev/null +++ b/SECURITY.md @@ -0,0 +1,9 @@ +# Security Policy + +## Reporting a Vulnerability + +If you discover a security vulnerability in this project, please report it through [GitHub Security Advisories](https://github.com/MachineDotDev/nat-zero/security/advisories/new). + +Do **not** open a public issue for security vulnerabilities. + +We will acknowledge your report within 48 hours and aim to provide a fix within 7 days for critical issues. diff --git a/cmd/lambda/ec2iface.go b/cmd/lambda/ec2iface.go new file mode 100644 index 0000000..0db74c5 --- /dev/null +++ b/cmd/lambda/ec2iface.go @@ -0,0 +1,25 @@ +package main + +import ( + "context" + + "github.com/aws/aws-sdk-go-v2/service/ec2" +) + +// EC2API is the subset of the EC2 client used by this Lambda. +type EC2API interface { + DescribeInstances(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) + RunInstances(ctx context.Context, params *ec2.RunInstancesInput, optFns ...func(*ec2.Options)) (*ec2.RunInstancesOutput, error) + StartInstances(ctx context.Context, params *ec2.StartInstancesInput, optFns ...func(*ec2.Options)) (*ec2.StartInstancesOutput, error) + StopInstances(ctx context.Context, params *ec2.StopInstancesInput, optFns ...func(*ec2.Options)) (*ec2.StopInstancesOutput, error) + TerminateInstances(ctx context.Context, params *ec2.TerminateInstancesInput, optFns ...func(*ec2.Options)) (*ec2.TerminateInstancesOutput, error) + AllocateAddress(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) + AssociateAddress(ctx context.Context, params *ec2.AssociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AssociateAddressOutput, error) + DisassociateAddress(ctx context.Context, params *ec2.DisassociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.DisassociateAddressOutput, error) + ReleaseAddress(ctx context.Context, params *ec2.ReleaseAddressInput, optFns ...func(*ec2.Options)) (*ec2.ReleaseAddressOutput, error) + DescribeAddresses(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) + DescribeNetworkInterfaces(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) + DescribeImages(ctx context.Context, params *ec2.DescribeImagesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeImagesOutput, error) + DescribeLaunchTemplates(ctx context.Context, params *ec2.DescribeLaunchTemplatesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplatesOutput, error) + DescribeLaunchTemplateVersions(ctx context.Context, params *ec2.DescribeLaunchTemplateVersionsInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplateVersionsOutput, error) +} diff --git a/cmd/lambda/ec2ops.go b/cmd/lambda/ec2ops.go new file mode 100644 index 0000000..2d850f6 --- /dev/null +++ b/cmd/lambda/ec2ops.go @@ -0,0 +1,579 @@ +package main + +import ( + "context" + "errors" + "fmt" + "log" + "sort" + "strings" + "time" + + "github.com/aws/aws-sdk-go-v2/aws" + "github.com/aws/aws-sdk-go-v2/service/ec2" + ec2types "github.com/aws/aws-sdk-go-v2/service/ec2/types" + "github.com/aws/smithy-go" +) + +// Instance is a simplified EC2 instance representation. +type Instance struct { + InstanceID string + StateName string + VpcID string + AZ string + Tags []ec2types.Tag + NetworkInterfaces []ec2types.InstanceNetworkInterface +} + +func instanceFromAPI(i ec2types.Instance) *Instance { + var stateName string + if i.State != nil { + stateName = string(i.State.Name) + } + var az string + if i.Placement != nil { + az = aws.ToString(i.Placement.AvailabilityZone) + } + return &Instance{ + InstanceID: aws.ToString(i.InstanceId), + StateName: stateName, + VpcID: aws.ToString(i.VpcId), + AZ: az, + Tags: i.Tags, + NetworkInterfaces: i.NetworkInterfaces, + } +} + +func hasTag(tags []ec2types.Tag, key, value string) bool { + for _, t := range tags { + if aws.ToString(t.Key) == key && aws.ToString(t.Value) == value { + return true + } + } + return false +} + +func (h *Handler) getInstance(ctx context.Context, instanceID string) *Instance { + resp, err := h.EC2.DescribeInstances(ctx, &ec2.DescribeInstancesInput{ + InstanceIds: []string{instanceID}, + }) + if err != nil || len(resp.Reservations) == 0 || len(resp.Reservations[0].Instances) == 0 { + return nil + } + return instanceFromAPI(resp.Reservations[0].Instances[0]) +} + +func (h *Handler) classify(ctx context.Context, instanceID string) (ignore, isNAT bool, az, vpc string) { + defer timed("classify")() + inst := h.getInstance(ctx, instanceID) + if inst == nil { + return true, false, "", "" + } + if inst.VpcID != h.TargetVPC { + return true, false, "", "" + } + if hasTag(inst.Tags, h.IgnoreTagKey, h.IgnoreTagValue) { + return true, false, inst.AZ, inst.VpcID + } + return false, hasTag(inst.Tags, h.NATTagKey, h.NATTagValue), inst.AZ, inst.VpcID +} + +func (h *Handler) waitForState(ctx context.Context, instanceID string, states []string, timeout int) bool { + iterations := timeout / 2 + for i := 0; i < iterations; i++ { + inst := h.getInstance(ctx, instanceID) + if inst == nil { + return false + } + for _, s := range states { + if inst.StateName == s { + return true + } + } + h.sleep(2 * time.Second) + } + log.Printf("Timeout: %s never reached %v", instanceID, states) + return false +} + +// findNAT finds the NAT instance in an AZ. Deduplicates if multiple exist. +func (h *Handler) findNAT(ctx context.Context, az, vpc string) *Instance { + defer timed("find_nat")() + resp, err := h.EC2.DescribeInstances(ctx, &ec2.DescribeInstancesInput{ + Filters: []ec2types.Filter{ + {Name: aws.String("tag:" + h.NATTagKey), Values: []string{h.NATTagValue}}, + {Name: aws.String("availability-zone"), Values: []string{az}}, + {Name: aws.String("vpc-id"), Values: []string{vpc}}, + {Name: aws.String("instance-state-name"), Values: []string{"pending", "running", "stopping", "stopped"}}, + }, + }) + if err != nil { + log.Printf("Error finding NAT: %v", err) + return nil + } + + var nats []*Instance + for _, r := range resp.Reservations { + for _, i := range r.Instances { + nats = append(nats, instanceFromAPI(i)) + } + } + + if len(nats) == 0 { + return nil + } + if len(nats) == 1 { + return nats[0] + } + + // Race condition: multiple NATs. Keep the running one, terminate extras. + log.Printf("%d NAT instances in %s, deduplicating", len(nats), az) + var running []*Instance + for _, n := range nats { + if isStarting(n.StateName) { + running = append(running, n) + } + } + keep := nats[0] + if len(running) > 0 { + keep = running[0] + } + for _, n := range nats { + if n.InstanceID != keep.InstanceID { + log.Printf("Terminating duplicate NAT %s", n.InstanceID) + _, err := h.EC2.TerminateInstances(ctx, &ec2.TerminateInstancesInput{ + InstanceIds: []string{n.InstanceID}, + }) + if err != nil { + log.Printf("Failed to terminate %s: %v", n.InstanceID, err) + } + } + } + return keep +} + +func (h *Handler) findSiblings(ctx context.Context, az, vpc string) []*Instance { + defer timed("find_siblings")() + resp, err := h.EC2.DescribeInstances(ctx, &ec2.DescribeInstancesInput{ + Filters: []ec2types.Filter{ + {Name: aws.String("availability-zone"), Values: []string{az}}, + {Name: aws.String("vpc-id"), Values: []string{vpc}}, + {Name: aws.String("instance-state-name"), Values: []string{"pending", "running"}}, + }, + }) + if err != nil { + log.Printf("Error finding siblings: %v", err) + return nil + } + + var siblings []*Instance + for _, r := range resp.Reservations { + for _, i := range r.Instances { + inst := instanceFromAPI(i) + if !hasTag(inst.Tags, h.NATTagKey, h.NATTagValue) && + !hasTag(inst.Tags, h.IgnoreTagKey, h.IgnoreTagValue) { + siblings = append(siblings, inst) + } + } + } + return siblings +} + +// --- EIP management (EventBridge-driven) --- + +func getPublicENI(inst *Instance) *ec2types.InstanceNetworkInterface { + for i := range inst.NetworkInterfaces { + if aws.ToInt32(inst.NetworkInterfaces[i].Attachment.DeviceIndex) == 0 { + return &inst.NetworkInterfaces[i] + } + } + return nil +} + +// attachEIP waits for the NAT instance to reach "running", then allocates and +// associates an EIP to the public ENI. Idempotent: no-op if ENI already has an EIP. +func (h *Handler) attachEIP(ctx context.Context, instanceID, az string) { + defer timed("attach_eip")() + + if !h.waitForState(ctx, instanceID, []string{"running"}, 120) { + return + } + + inst := h.getInstance(ctx, instanceID) + if inst == nil { + return + } + eni := getPublicENI(inst) + if eni == nil { + log.Printf("No public ENI on %s", instanceID) + return + } + + // Idempotent: if ENI already has an EIP, nothing to do. + if eni.Association != nil && aws.ToString(eni.Association.PublicIp) != "" { + log.Printf("ENI %s already has EIP %s", aws.ToString(eni.NetworkInterfaceId), aws.ToString(eni.Association.PublicIp)) + return + } + + eniID := aws.ToString(eni.NetworkInterfaceId) + + alloc, err := h.EC2.AllocateAddress(ctx, &ec2.AllocateAddressInput{ + Domain: ec2types.DomainTypeVpc, + TagSpecifications: []ec2types.TagSpecification{{ + ResourceType: ec2types.ResourceTypeElasticIp, + Tags: []ec2types.Tag{ + {Key: aws.String("AZ"), Value: aws.String(az)}, + {Key: aws.String(h.NATTagKey), Value: aws.String(h.NATTagValue)}, + {Key: aws.String("Name"), Value: aws.String(fmt.Sprintf("nat-eip-%s", az))}, + }, + }}, + }) + if err != nil { + log.Printf("Failed to allocate EIP: %v", err) + return + } + allocID := aws.ToString(alloc.AllocationId) + + _, err = h.EC2.AssociateAddress(ctx, &ec2.AssociateAddressInput{ + AllocationId: aws.String(allocID), + NetworkInterfaceId: aws.String(eniID), + }) + if err != nil { + log.Printf("Failed to associate EIP: %v", err) + h.EC2.ReleaseAddress(ctx, &ec2.ReleaseAddressInput{AllocationId: aws.String(allocID)}) + return + } + log.Printf("Attached EIP %s to %s", aws.ToString(alloc.PublicIp), eniID) +} + +// detachEIP waits for the NAT instance to reach "stopped", then disassociates +// and releases the EIP from the public ENI. Idempotent: no-op if no EIP. +func (h *Handler) detachEIP(ctx context.Context, instanceID string) { + defer timed("detach_eip")() + + if !h.waitForState(ctx, instanceID, []string{"stopped"}, 120) { + return + } + + inst := h.getInstance(ctx, instanceID) + if inst == nil { + return + } + eni := getPublicENI(inst) + if eni == nil { + return + } + eniID := aws.ToString(eni.NetworkInterfaceId) + + niResp, err := h.EC2.DescribeNetworkInterfaces(ctx, &ec2.DescribeNetworkInterfacesInput{ + NetworkInterfaceIds: []string{eniID}, + }) + if err != nil { + log.Printf("Failed to describe ENI %s: %v", eniID, err) + return + } + if len(niResp.NetworkInterfaces) == 0 { + return + } + ni := niResp.NetworkInterfaces[0] + if ni.Association == nil || aws.ToString(ni.Association.AssociationId) == "" { + return + } + + assocID := aws.ToString(ni.Association.AssociationId) + allocID := aws.ToString(ni.Association.AllocationId) + publicIP := aws.ToString(ni.Association.PublicIp) + + _, err = h.EC2.DisassociateAddress(ctx, &ec2.DisassociateAddressInput{ + AssociationId: aws.String(assocID), + }) + if err != nil { + if isErrCode(err, "InvalidAssociationID.NotFound") { + log.Printf("EIP already disassociated from %s", eniID) + } else { + log.Printf("Failed to detach EIP from %s: %v", eniID, err) + return + } + } + _, err = h.EC2.ReleaseAddress(ctx, &ec2.ReleaseAddressInput{ + AllocationId: aws.String(allocID), + }) + if err != nil { + log.Printf("Failed to release EIP %s: %v", allocID, err) + return + } + log.Printf("Released EIP %s from %s", publicIP, eniID) +} + +// --- Config version --- + +func (h *Handler) isCurrentConfig(inst *Instance) bool { + if h.ConfigVersion == "" { + return true + } + for _, t := range inst.Tags { + if aws.ToString(t.Key) == "ConfigVersion" { + return aws.ToString(t.Value) == h.ConfigVersion + } + } + return true // no tag to compare — assume current +} + +func (h *Handler) replaceNAT(ctx context.Context, inst *Instance, az, vpc string) string { + defer timed("replace_nat")() + iid := inst.InstanceID + var eniIDs []string + for _, eni := range inst.NetworkInterfaces { + eniIDs = append(eniIDs, aws.ToString(eni.NetworkInterfaceId)) + } + + log.Printf("Replacing outdated NAT %s in %s", iid, az) + h.EC2.TerminateInstances(ctx, &ec2.TerminateInstancesInput{ + InstanceIds: []string{iid}, + }) + + // Wait for termination using polling. + h.waitForTermination(ctx, iid) + + // Wait for ENIs to become available. + if len(eniIDs) > 0 { + for i := 0; i < 60; i++ { + niResp, err := h.EC2.DescribeNetworkInterfaces(ctx, &ec2.DescribeNetworkInterfacesInput{ + NetworkInterfaceIds: eniIDs, + }) + if err == nil { + allAvailable := true + for _, ni := range niResp.NetworkInterfaces { + if ni.Status != ec2types.NetworkInterfaceStatusAvailable { + allAvailable = false + break + } + } + if allAvailable { + break + } + } + h.sleep(2 * time.Second) + } + } + + return h.createNAT(ctx, az, vpc) +} + +func (h *Handler) waitForTermination(ctx context.Context, instanceID string) { + for i := 0; i < 100; i++ { + inst := h.getInstance(ctx, instanceID) + if inst == nil || inst.StateName == "terminated" { + return + } + h.sleep(2 * time.Second) + } + log.Printf("Timeout waiting for %s to terminate", instanceID) +} + +// --- NAT lifecycle helpers --- + +func (h *Handler) resolveAMI(ctx context.Context) string { + defer timed("resolve_ami")() + resp, err := h.EC2.DescribeImages(ctx, &ec2.DescribeImagesInput{ + Owners: []string{h.AMIOwner}, + Filters: []ec2types.Filter{ + {Name: aws.String("name"), Values: []string{h.AMIPattern}}, + {Name: aws.String("state"), Values: []string{"available"}}, + }, + }) + if err != nil { + log.Printf("AMI lookup failed, using launch template default: %v", err) + return "" + } + if len(resp.Images) == 0 { + return "" + } + + // Pick the latest by CreationDate. + images := resp.Images + sort.Slice(images, func(i, j int) bool { + return aws.ToString(images[i].CreationDate) > aws.ToString(images[j].CreationDate) + }) + ami := images[0] + amiID := aws.ToString(ami.ImageId) + log.Printf("Using AMI %s (%s)", amiID, aws.ToString(ami.Name)) + return amiID +} + +func (h *Handler) resolveLT(ctx context.Context, az, vpc string) (string, int64) { + defer timed("resolve_lt")() + resp, err := h.EC2.DescribeLaunchTemplates(ctx, &ec2.DescribeLaunchTemplatesInput{ + Filters: []ec2types.Filter{ + {Name: aws.String("tag:AvailabilityZone"), Values: []string{az}}, + {Name: aws.String("tag:VpcId"), Values: []string{vpc}}, + }, + }) + if err != nil || len(resp.LaunchTemplates) == 0 { + return "", 0 + } + + ltID := aws.ToString(resp.LaunchTemplates[0].LaunchTemplateId) + + verResp, err := h.EC2.DescribeLaunchTemplateVersions(ctx, &ec2.DescribeLaunchTemplateVersionsInput{ + LaunchTemplateId: aws.String(ltID), + Versions: []string{"$Latest"}, + }) + if err != nil || len(verResp.LaunchTemplateVersions) == 0 { + return "", 0 + } + + version := aws.ToInt64(verResp.LaunchTemplateVersions[0].VersionNumber) + return ltID, version +} + +func (h *Handler) createNAT(ctx context.Context, az, vpc string) string { + defer timed("create_nat")() + + ltID, version := h.resolveLT(ctx, az, vpc) + if ltID == "" { + log.Printf("No launch template for AZ=%s VPC=%s", az, vpc) + return "" + } + + amiID := h.resolveAMI(ctx) + + input := &ec2.RunInstancesInput{ + LaunchTemplate: &ec2types.LaunchTemplateSpecification{ + LaunchTemplateId: aws.String(ltID), + Version: aws.String(fmt.Sprintf("%d", version)), + }, + MinCount: aws.Int32(1), + MaxCount: aws.Int32(1), + } + + if h.ConfigVersion != "" { + input.TagSpecifications = []ec2types.TagSpecification{{ + ResourceType: ec2types.ResourceTypeInstance, + Tags: []ec2types.Tag{ + {Key: aws.String("ConfigVersion"), Value: aws.String(h.ConfigVersion)}, + }, + }} + } + + if amiID != "" { + input.ImageId = aws.String(amiID) + } + + resp, err := h.EC2.RunInstances(ctx, input) + if err != nil { + log.Printf("Failed to create NAT instance: %v", err) + return "" + } + iid := aws.ToString(resp.Instances[0].InstanceId) + log.Printf("Created NAT instance %s in %s", iid, az) + return iid +} + +func (h *Handler) startNAT(ctx context.Context, inst *Instance, az string) { + defer timed("start_nat")() + iid := inst.InstanceID + if !h.waitForState(ctx, iid, []string{"stopped"}, 90) { + return + } + _, err := h.EC2.StartInstances(ctx, &ec2.StartInstancesInput{ + InstanceIds: []string{iid}, + }) + if err != nil { + log.Printf("Failed to start NAT %s: %v", iid, err) + return + } + log.Printf("Started NAT %s", iid) +} + +func (h *Handler) stopNAT(ctx context.Context, inst *Instance) { + defer timed("stop_nat")() + iid := inst.InstanceID + _, err := h.EC2.StopInstances(ctx, &ec2.StopInstancesInput{ + InstanceIds: []string{iid}, + }) + if err != nil { + log.Printf("Failed to stop NAT %s: %v", iid, err) + return + } + log.Printf("Stopped NAT %s", iid) +} + +// --- Cleanup (destroy-time) --- + +func (h *Handler) cleanupAll(ctx context.Context) { + defer timed("cleanup_all")() + + resp, err := h.EC2.DescribeInstances(ctx, &ec2.DescribeInstancesInput{ + Filters: []ec2types.Filter{ + {Name: aws.String("tag:" + h.NATTagKey), Values: []string{h.NATTagValue}}, + {Name: aws.String("vpc-id"), Values: []string{h.TargetVPC}}, + {Name: aws.String("instance-state-name"), Values: []string{"pending", "running", "stopping", "stopped"}}, + }, + }) + if err != nil { + log.Printf("Error listing NAT instances: %v", err) + return + } + + var instanceIDs []string + for _, r := range resp.Reservations { + for _, i := range r.Instances { + instanceIDs = append(instanceIDs, aws.ToString(i.InstanceId)) + } + } + + if len(instanceIDs) > 0 { + log.Printf("Terminating NAT instances: %v", instanceIDs) + h.EC2.TerminateInstances(ctx, &ec2.TerminateInstancesInput{ + InstanceIds: instanceIDs, + }) + } + + // Release EIPs while instances are terminating (overlap the wait). + addrResp, err := h.EC2.DescribeAddresses(ctx, &ec2.DescribeAddressesInput{ + Filters: []ec2types.Filter{ + {Name: aws.String("tag:" + h.NATTagKey), Values: []string{h.NATTagValue}}, + }, + }) + if err == nil { + for _, addr := range addrResp.Addresses { + allocID := aws.ToString(addr.AllocationId) + if addr.AssociationId != nil { + _, err := h.EC2.DisassociateAddress(ctx, &ec2.DisassociateAddressInput{ + AssociationId: addr.AssociationId, + }) + if err != nil { + log.Printf("Failed to disassociate EIP %s: %v", allocID, err) + } + } + _, err := h.EC2.ReleaseAddress(ctx, &ec2.ReleaseAddressInput{ + AllocationId: aws.String(allocID), + }) + if err != nil { + log.Printf("Failed to release EIP %s: %v", allocID, err) + } else { + log.Printf("Released EIP %s", allocID) + } + } + } + + // Wait for instance termination. + if len(instanceIDs) > 0 { + for _, iid := range instanceIDs { + h.waitForTermination(ctx, iid) + } + log.Println("All NAT instances terminated") + } +} + +// isErrCode returns true if the error (or any wrapped error) has the given +// AWS API error code. Works with both smithy APIError and legacy awserr. +func isErrCode(err error, code string) bool { + var ae smithy.APIError + if ok := errors.As(err, &ae); ok { + return ae.ErrorCode() == code + } + // Fallback: check the error string for SDKs that don't implement APIError. + return strings.Contains(err.Error(), code) +} diff --git a/cmd/lambda/ec2ops_test.go b/cmd/lambda/ec2ops_test.go new file mode 100644 index 0000000..ec8787e --- /dev/null +++ b/cmd/lambda/ec2ops_test.go @@ -0,0 +1,909 @@ +package main + +import ( + "context" + "fmt" + "sync/atomic" + "testing" + + "github.com/aws/aws-sdk-go-v2/aws" + "github.com/aws/aws-sdk-go-v2/service/ec2" + ec2types "github.com/aws/aws-sdk-go-v2/service/ec2/types" +) + +// --- classify() --- + +func TestClassify(t *testing.T) { + t.Run("instance not found", func(t *testing.T) { + mock := &mockEC2{} + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(), nil + } + h := newTestHandler(mock) + ignore, isNAT, az, vpc := h.classify(context.Background(), "i-gone") + if !ignore || isNAT || az != "" || vpc != "" { + t.Errorf("expected (true, false, '', ''), got (%v, %v, %q, %q)", ignore, isNAT, az, vpc) + } + }) + + t.Run("wrong VPC", func(t *testing.T) { + mock := &mockEC2{} + inst := makeTestInstance("i-other", "running", "vpc-other", testAZ, nil, nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(inst), nil + } + h := newTestHandler(mock) + ignore, isNAT, _, _ := h.classify(context.Background(), "i-other") + if !ignore || isNAT { + t.Errorf("expected (true, false), got (%v, %v)", ignore, isNAT) + } + }) + + t.Run("ignore tag", func(t *testing.T) { + mock := &mockEC2{} + inst := makeTestInstance("i-ign", "running", testVPC, testAZ, + []ec2types.Tag{{Key: aws.String("nat-zero:ignore"), Value: aws.String("true")}}, nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(inst), nil + } + h := newTestHandler(mock) + ignore, isNAT, az, vpc := h.classify(context.Background(), "i-ign") + if !ignore || isNAT || az != testAZ || vpc != testVPC { + t.Errorf("expected (true, false, %q, %q), got (%v, %v, %q, %q)", testAZ, testVPC, ignore, isNAT, az, vpc) + } + }) + + t.Run("NAT instance", func(t *testing.T) { + mock := &mockEC2{} + inst := makeTestInstance("i-nat", "running", testVPC, testAZ, + []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}}, nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(inst), nil + } + h := newTestHandler(mock) + ignore, isNAT, az, vpc := h.classify(context.Background(), "i-nat") + if ignore || !isNAT || az != testAZ || vpc != testVPC { + t.Errorf("expected (false, true, %q, %q), got (%v, %v, %q, %q)", testAZ, testVPC, ignore, isNAT, az, vpc) + } + }) + + t.Run("normal workload", func(t *testing.T) { + mock := &mockEC2{} + inst := makeTestInstance("i-work", "running", testVPC, testAZ, + []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}}, nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(inst), nil + } + h := newTestHandler(mock) + ignore, isNAT, az, vpc := h.classify(context.Background(), "i-work") + if ignore || isNAT || az != testAZ || vpc != testVPC { + t.Errorf("expected (false, false, %q, %q), got (%v, %v, %q, %q)", testAZ, testVPC, ignore, isNAT, az, vpc) + } + }) +} + +// --- waitForState() --- + +func TestWaitForState(t *testing.T) { + t.Run("already in desired state", func(t *testing.T) { + mock := &mockEC2{} + inst := makeTestInstance("i-1", "running", testVPC, testAZ, nil, nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(inst), nil + } + h := newTestHandler(mock) + if !h.waitForState(context.Background(), "i-1", []string{"running"}, 10) { + t.Error("expected true") + } + }) + + t.Run("transitions to desired state", func(t *testing.T) { + mock := &mockEC2{} + var idx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + i := atomic.AddInt32(&idx, 1) + if i == 1 { + return describeResponse(makeTestInstance("i-1", "pending", testVPC, testAZ, nil, nil)), nil + } + return describeResponse(makeTestInstance("i-1", "running", testVPC, testAZ, nil, nil)), nil + } + h := newTestHandler(mock) + if !h.waitForState(context.Background(), "i-1", []string{"running"}, 10) { + t.Error("expected true") + } + }) + + t.Run("timeout", func(t *testing.T) { + mock := &mockEC2{} + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(makeTestInstance("i-1", "pending", testVPC, testAZ, nil, nil)), nil + } + h := newTestHandler(mock) + if h.waitForState(context.Background(), "i-1", []string{"running"}, 10) { + t.Error("expected false (timeout)") + } + }) + + t.Run("instance disappears", func(t *testing.T) { + mock := &mockEC2{} + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(), nil + } + h := newTestHandler(mock) + if h.waitForState(context.Background(), "i-gone", []string{"running"}, 10) { + t.Error("expected false") + } + }) +} + +// --- findNAT() --- + +func TestFindNAT(t *testing.T) { + t.Run("no NATs", func(t *testing.T) { + mock := &mockEC2{} + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(), nil + } + h := newTestHandler(mock) + if h.findNAT(context.Background(), testAZ, testVPC) != nil { + t.Error("expected nil") + } + }) + + t.Run("single NAT", func(t *testing.T) { + mock := &mockEC2{} + nat := makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(nat), nil + } + h := newTestHandler(mock) + result := h.findNAT(context.Background(), testAZ, testVPC) + if result == nil || result.InstanceID != "i-nat1" { + t.Errorf("expected i-nat1, got %v", result) + } + }) + + t.Run("deduplicates keeps running", func(t *testing.T) { + mock := &mockEC2{} + running := makeTestInstance("i-run", "running", testVPC, testAZ, nil, nil) + stopped1 := makeTestInstance("i-stop1", "stopped", testVPC, testAZ, nil, nil) + stopped2 := makeTestInstance("i-stop2", "stopped", testVPC, testAZ, nil, nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(running, stopped1, stopped2), nil + } + h := newTestHandler(mock) + result := h.findNAT(context.Background(), testAZ, testVPC) + if result == nil || result.InstanceID != "i-run" { + t.Errorf("expected i-run, got %v", result) + } + if mock.callCount("TerminateInstances") != 2 { + t.Errorf("expected 2 TerminateInstances calls, got %d", mock.callCount("TerminateInstances")) + } + }) + + t.Run("deduplicates no running keeps first", func(t *testing.T) { + mock := &mockEC2{} + s1 := makeTestInstance("i-s1", "stopped", testVPC, testAZ, nil, nil) + s2 := makeTestInstance("i-s2", "stopped", testVPC, testAZ, nil, nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(s1, s2), nil + } + h := newTestHandler(mock) + result := h.findNAT(context.Background(), testAZ, testVPC) + if result == nil || result.InstanceID != "i-s1" { + t.Errorf("expected i-s1, got %v", result) + } + if mock.callCount("TerminateInstances") != 1 { + t.Errorf("expected 1 TerminateInstances call, got %d", mock.callCount("TerminateInstances")) + } + }) + + t.Run("deduplication handles terminate failure", func(t *testing.T) { + mock := &mockEC2{} + running := makeTestInstance("i-run", "running", testVPC, testAZ, nil, nil) + extra := makeTestInstance("i-extra", "stopped", testVPC, testAZ, nil, nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(running, extra), nil + } + mock.TerminateInstancesFn = func(ctx context.Context, params *ec2.TerminateInstancesInput, optFns ...func(*ec2.Options)) (*ec2.TerminateInstancesOutput, error) { + return nil, fmt.Errorf("UnauthorizedOperation: Not allowed") + } + h := newTestHandler(mock) + result := h.findNAT(context.Background(), testAZ, testVPC) + if result == nil || result.InstanceID != "i-run" { + t.Errorf("expected i-run despite terminate failure, got %v", result) + } + }) +} + +// --- findSiblings() --- + +func TestFindSiblings(t *testing.T) { + t.Run("no siblings", func(t *testing.T) { + mock := &mockEC2{} + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(), nil + } + h := newTestHandler(mock) + sibs := h.findSiblings(context.Background(), testAZ, testVPC) + if len(sibs) != 0 { + t.Errorf("expected 0 siblings, got %d", len(sibs)) + } + }) + + t.Run("returns workload instances", func(t *testing.T) { + mock := &mockEC2{} + work := makeTestInstance("i-work", "running", testVPC, testAZ, + []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("api")}}, nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(work), nil + } + h := newTestHandler(mock) + sibs := h.findSiblings(context.Background(), testAZ, testVPC) + if len(sibs) != 1 || sibs[0].InstanceID != "i-work" { + t.Errorf("expected [i-work], got %v", sibs) + } + }) + + t.Run("excludes NAT and ignored", func(t *testing.T) { + mock := &mockEC2{} + work := makeTestInstance("i-work", "running", testVPC, testAZ, + []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("api")}}, nil) + nat := makeTestInstance("i-nat", "running", testVPC, testAZ, + []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}}, nil) + ignored := makeTestInstance("i-ign", "running", testVPC, testAZ, + []ec2types.Tag{{Key: aws.String("nat-zero:ignore"), Value: aws.String("true")}}, nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(work, nat, ignored), nil + } + h := newTestHandler(mock) + sibs := h.findSiblings(context.Background(), testAZ, testVPC) + if len(sibs) != 1 || sibs[0].InstanceID != "i-work" { + t.Errorf("expected [i-work], got %v", sibs) + } + }) +} + +// --- attachEIP() --- + +func TestAttachEIP(t *testing.T) { + t.Run("happy path", func(t *testing.T) { + mock := &mockEC2{} + eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil + } + mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { + return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-1"), PublicIp: aws.String("1.2.3.4")}, nil + } + mock.AssociateAddressFn = func(ctx context.Context, params *ec2.AssociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AssociateAddressOutput, error) { + return &ec2.AssociateAddressOutput{}, nil + } + h := newTestHandler(mock) + h.attachEIP(context.Background(), "i-nat1", testAZ) + if mock.callCount("AllocateAddress") != 1 { + t.Error("expected AllocateAddress") + } + if mock.callCount("AssociateAddress") != 1 { + t.Error("expected AssociateAddress") + } + }) + + t.Run("already has EIP is noop", func(t *testing.T) { + mock := &mockEC2{} + assoc := &ec2types.InstanceNetworkInterfaceAssociation{PublicIp: aws.String("5.6.7.8")} + eni := makeENI("eni-pub1", 0, "10.0.1.10", assoc) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil + } + h := newTestHandler(mock) + h.attachEIP(context.Background(), "i-nat1", testAZ) + if mock.callCount("AllocateAddress") != 0 { + t.Error("expected no AllocateAddress when ENI already has EIP") + } + }) + + t.Run("allocation fails", func(t *testing.T) { + mock := &mockEC2{} + eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil + } + mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { + return nil, fmt.Errorf("AddressLimitExceeded: Too many EIPs") + } + h := newTestHandler(mock) + h.attachEIP(context.Background(), "i-nat1", testAZ) + if mock.callCount("AssociateAddress") != 0 { + t.Error("expected AssociateAddress NOT to be called") + } + }) + + t.Run("association fails releases EIP", func(t *testing.T) { + mock := &mockEC2{} + eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil + } + mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { + return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-1"), PublicIp: aws.String("1.2.3.4")}, nil + } + mock.AssociateAddressFn = func(ctx context.Context, params *ec2.AssociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AssociateAddressOutput, error) { + return nil, fmt.Errorf("InvalidParameterValue: Bad param") + } + h := newTestHandler(mock) + h.attachEIP(context.Background(), "i-nat1", testAZ) + if mock.callCount("ReleaseAddress") != 1 { + t.Errorf("expected 1 ReleaseAddress call, got %d", mock.callCount("ReleaseAddress")) + } + }) + + t.Run("no public ENI", func(t *testing.T) { + mock := &mockEC2{} + // Instance with no ENIs + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, nil)), nil + } + h := newTestHandler(mock) + h.attachEIP(context.Background(), "i-nat1", testAZ) + if mock.callCount("AllocateAddress") != 0 { + t.Error("expected no AllocateAddress when no public ENI") + } + }) +} + +// --- detachEIP() --- + +func TestDetachEIP(t *testing.T) { + t.Run("happy path", func(t *testing.T) { + mock := &mockEC2{} + eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(makeTestInstance("i-nat1", "stopped", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil + } + mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { + return &ec2.DescribeNetworkInterfacesOutput{ + NetworkInterfaces: []ec2types.NetworkInterface{{ + NetworkInterfaceId: aws.String("eni-pub1"), + Association: &ec2types.NetworkInterfaceAssociation{ + AssociationId: aws.String("eipassoc-1"), + AllocationId: aws.String("eipalloc-1"), + PublicIp: aws.String("1.2.3.4"), + }, + }}, + }, nil + } + h := newTestHandler(mock) + h.detachEIP(context.Background(), "i-nat1") + if mock.callCount("DisassociateAddress") != 1 { + t.Error("expected DisassociateAddress") + } + if mock.callCount("ReleaseAddress") != 1 { + t.Error("expected ReleaseAddress") + } + }) + + t.Run("no association is noop", func(t *testing.T) { + mock := &mockEC2{} + eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(makeTestInstance("i-nat1", "stopped", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil + } + mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { + return &ec2.DescribeNetworkInterfacesOutput{ + NetworkInterfaces: []ec2types.NetworkInterface{{ + NetworkInterfaceId: aws.String("eni-pub1"), + }}, + }, nil + } + h := newTestHandler(mock) + h.detachEIP(context.Background(), "i-nat1") + if mock.callCount("DisassociateAddress") != 0 { + t.Error("expected DisassociateAddress NOT to be called") + } + }) +} + +// --- createNAT() --- + +func TestCreateNAT(t *testing.T) { + setupLTAndAMI := func(mock *mockEC2) { + mock.DescribeLaunchTemplatesFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplatesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplatesOutput, error) { + return &ec2.DescribeLaunchTemplatesOutput{ + LaunchTemplates: []ec2types.LaunchTemplate{{LaunchTemplateId: aws.String("lt-123")}}, + }, nil + } + mock.DescribeLaunchTemplateVersionsFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplateVersionsInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplateVersionsOutput, error) { + return &ec2.DescribeLaunchTemplateVersionsOutput{ + LaunchTemplateVersions: []ec2types.LaunchTemplateVersion{{ + LaunchTemplateId: aws.String("lt-123"), VersionNumber: aws.Int64(1), + }}, + }, nil + } + } + + t.Run("happy path without inline EIP", func(t *testing.T) { + mock := &mockEC2{} + setupLTAndAMI(mock) + mock.DescribeImagesFn = func(ctx context.Context, params *ec2.DescribeImagesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeImagesOutput, error) { + return &ec2.DescribeImagesOutput{ + Images: []ec2types.Image{{ + ImageId: aws.String("ami-fcknat"), + Name: aws.String("fck-nat-al2023-1.0-arm64-20240101"), + CreationDate: aws.String("2024-01-01T00:00:00.000Z"), + }}, + }, nil + } + mock.RunInstancesFn = func(ctx context.Context, params *ec2.RunInstancesInput, optFns ...func(*ec2.Options)) (*ec2.RunInstancesOutput, error) { + return &ec2.RunInstancesOutput{ + Instances: []ec2types.Instance{{InstanceId: aws.String("i-new1")}}, + }, nil + } + h := newTestHandler(mock) + result := h.createNAT(context.Background(), testAZ, testVPC) + if result != "i-new1" { + t.Errorf("expected i-new1, got %s", result) + } + // No inline EIP — that's handled by EventBridge now + if mock.callCount("AllocateAddress") != 0 { + t.Error("expected no AllocateAddress (EIP managed via EventBridge)") + } + }) + + t.Run("no launch template", func(t *testing.T) { + mock := &mockEC2{} + mock.DescribeLaunchTemplatesFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplatesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplatesOutput, error) { + return &ec2.DescribeLaunchTemplatesOutput{LaunchTemplates: []ec2types.LaunchTemplate{}}, nil + } + h := newTestHandler(mock) + result := h.createNAT(context.Background(), testAZ, testVPC) + if result != "" { + t.Errorf("expected empty, got %s", result) + } + }) + + t.Run("AMI lookup fails uses template default", func(t *testing.T) { + mock := &mockEC2{} + setupLTAndAMI(mock) + mock.DescribeImagesFn = func(ctx context.Context, params *ec2.DescribeImagesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeImagesOutput, error) { + return nil, fmt.Errorf("InvalidParameterValue: Bad filter") + } + mock.RunInstancesFn = func(ctx context.Context, params *ec2.RunInstancesInput, optFns ...func(*ec2.Options)) (*ec2.RunInstancesOutput, error) { + if params.ImageId != nil { + t.Error("expected no ImageId when AMI lookup fails") + } + return &ec2.RunInstancesOutput{ + Instances: []ec2types.Instance{{InstanceId: aws.String("i-new2")}}, + }, nil + } + h := newTestHandler(mock) + result := h.createNAT(context.Background(), testAZ, testVPC) + if result != "i-new2" { + t.Errorf("expected i-new2, got %s", result) + } + }) + + t.Run("no images found uses template default", func(t *testing.T) { + mock := &mockEC2{} + setupLTAndAMI(mock) + mock.DescribeImagesFn = func(ctx context.Context, params *ec2.DescribeImagesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeImagesOutput, error) { + return &ec2.DescribeImagesOutput{Images: []ec2types.Image{}}, nil + } + mock.RunInstancesFn = func(ctx context.Context, params *ec2.RunInstancesInput, optFns ...func(*ec2.Options)) (*ec2.RunInstancesOutput, error) { + return &ec2.RunInstancesOutput{ + Instances: []ec2types.Instance{{InstanceId: aws.String("i-new3")}}, + }, nil + } + h := newTestHandler(mock) + result := h.createNAT(context.Background(), testAZ, testVPC) + if result != "i-new3" { + t.Errorf("expected i-new3, got %s", result) + } + }) + + t.Run("run instances fails", func(t *testing.T) { + mock := &mockEC2{} + setupLTAndAMI(mock) + mock.DescribeImagesFn = func(ctx context.Context, params *ec2.DescribeImagesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeImagesOutput, error) { + return &ec2.DescribeImagesOutput{Images: []ec2types.Image{}}, nil + } + mock.RunInstancesFn = func(ctx context.Context, params *ec2.RunInstancesInput, optFns ...func(*ec2.Options)) (*ec2.RunInstancesOutput, error) { + return nil, fmt.Errorf("InsufficientInstanceCapacity: No capacity") + } + h := newTestHandler(mock) + result := h.createNAT(context.Background(), testAZ, testVPC) + if result != "" { + t.Errorf("expected empty, got %s", result) + } + }) +} + +// --- startNAT() --- + +func TestStartNAT(t *testing.T) { + t.Run("happy path", func(t *testing.T) { + mock := &mockEC2{} + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(makeTestInstance("i-nat1", "stopped", testVPC, testAZ, nil, nil)), nil + } + mock.StartInstancesFn = func(ctx context.Context, params *ec2.StartInstancesInput, optFns ...func(*ec2.Options)) (*ec2.StartInstancesOutput, error) { + return &ec2.StartInstancesOutput{}, nil + } + h := newTestHandler(mock) + h.startNAT(context.Background(), &Instance{InstanceID: "i-nat1"}, testAZ) + if mock.callCount("StartInstances") != 1 { + t.Error("expected StartInstances to be called") + } + // No inline EIP — that's handled by EventBridge now + if mock.callCount("AllocateAddress") != 0 { + t.Error("expected no AllocateAddress (EIP managed via EventBridge)") + } + }) + + t.Run("wait timeout", func(t *testing.T) { + mock := &mockEC2{} + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(makeTestInstance("i-nat1", "stopping", testVPC, testAZ, nil, nil)), nil + } + h := newTestHandler(mock) + h.startNAT(context.Background(), &Instance{InstanceID: "i-nat1"}, testAZ) + if mock.callCount("StartInstances") != 0 { + t.Error("expected StartInstances NOT to be called after timeout") + } + }) +} + +// --- stopNAT() --- + +func TestStopNAT(t *testing.T) { + t.Run("happy path just stops", func(t *testing.T) { + mock := &mockEC2{} + h := newTestHandler(mock) + h.stopNAT(context.Background(), &Instance{InstanceID: "i-nat1"}) + if mock.callCount("StopInstances") != 1 { + t.Error("expected StopInstances to be called") + } + // No inline EIP release — that's handled by EventBridge now + if mock.callCount("DisassociateAddress") != 0 { + t.Error("expected no DisassociateAddress (EIP managed via EventBridge)") + } + }) + + t.Run("stop fails", func(t *testing.T) { + mock := &mockEC2{} + mock.StopInstancesFn = func(ctx context.Context, params *ec2.StopInstancesInput, optFns ...func(*ec2.Options)) (*ec2.StopInstancesOutput, error) { + return nil, fmt.Errorf("IncorrectInstanceState: Already stopping") + } + h := newTestHandler(mock) + h.stopNAT(context.Background(), &Instance{InstanceID: "i-nat1"}) + }) +} + +// --- cleanupAll() --- + +func TestCleanupAll(t *testing.T) { + t.Run("terminates instances and releases EIPs", func(t *testing.T) { + mock := &mockEC2{} + nat1 := makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, nil) + nat2 := makeTestInstance("i-nat2", "stopped", testVPC, testAZ, nil, nil) + var describeIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&describeIdx, 1) + if idx == 1 { + return describeResponse(nat1, nat2), nil + } + return describeResponse(makeTestInstance("i-nat1", "terminated", testVPC, testAZ, nil, nil)), nil + } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{ + Addresses: []ec2types.Address{{ + AllocationId: aws.String("eipalloc-1"), + AssociationId: aws.String("eipassoc-1"), + }}, + }, nil + } + h := newTestHandler(mock) + h.cleanupAll(context.Background()) + if mock.callCount("TerminateInstances") != 1 { + t.Errorf("expected 1 TerminateInstances call, got %d", mock.callCount("TerminateInstances")) + } + if mock.callCount("DisassociateAddress") != 1 { + t.Error("expected DisassociateAddress to be called") + } + if mock.callCount("ReleaseAddress") != 1 { + t.Error("expected ReleaseAddress to be called") + } + }) + + t.Run("no instances still cleans EIPs", func(t *testing.T) { + mock := &mockEC2{} + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(), nil + } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{ + Addresses: []ec2types.Address{{AllocationId: aws.String("eipalloc-1")}}, + }, nil + } + h := newTestHandler(mock) + h.cleanupAll(context.Background()) + if mock.callCount("ReleaseAddress") != 1 { + t.Error("expected ReleaseAddress to be called") + } + }) + + t.Run("no instances no EIPs", func(t *testing.T) { + mock := &mockEC2{} + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(), nil + } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{Addresses: []ec2types.Address{}}, nil + } + h := newTestHandler(mock) + h.cleanupAll(context.Background()) + if mock.callCount("TerminateInstances") != 0 { + t.Error("expected no TerminateInstances calls") + } + }) + + t.Run("EIP release failure continues", func(t *testing.T) { + mock := &mockEC2{} + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(), nil + } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{ + Addresses: []ec2types.Address{ + {AllocationId: aws.String("eipalloc-1")}, + {AllocationId: aws.String("eipalloc-2")}, + }, + }, nil + } + var releaseIdx int32 + mock.ReleaseAddressFn = func(ctx context.Context, params *ec2.ReleaseAddressInput, optFns ...func(*ec2.Options)) (*ec2.ReleaseAddressOutput, error) { + idx := atomic.AddInt32(&releaseIdx, 1) + if idx == 1 { + return nil, fmt.Errorf("InvalidAddress.NotFound: Not found") + } + return &ec2.ReleaseAddressOutput{}, nil + } + h := newTestHandler(mock) + h.cleanupAll(context.Background()) + if mock.callCount("ReleaseAddress") != 2 { + t.Errorf("expected 2 ReleaseAddress calls, got %d", mock.callCount("ReleaseAddress")) + } + }) +} + +// --- isCurrentConfig() --- + +func TestIsCurrentConfig(t *testing.T) { + t.Run("matching config", func(t *testing.T) { + h := newTestHandler(nil) + h.ConfigVersion = "abc123" + inst := &Instance{Tags: []ec2types.Tag{{Key: aws.String("ConfigVersion"), Value: aws.String("abc123")}}} + if !h.isCurrentConfig(inst) { + t.Error("expected true") + } + }) + + t.Run("mismatched config", func(t *testing.T) { + h := newTestHandler(nil) + h.ConfigVersion = "abc123" + inst := &Instance{Tags: []ec2types.Tag{{Key: aws.String("ConfigVersion"), Value: aws.String("old456")}}} + if h.isCurrentConfig(inst) { + t.Error("expected false") + } + }) + + t.Run("no tag assumes current", func(t *testing.T) { + h := newTestHandler(nil) + h.ConfigVersion = "abc123" + inst := &Instance{Tags: []ec2types.Tag{{Key: aws.String("Name"), Value: aws.String("nat")}}} + if !h.isCurrentConfig(inst) { + t.Error("expected true — missing tag means nothing to compare") + } + }) + + t.Run("no tags at all assumes current", func(t *testing.T) { + h := newTestHandler(nil) + h.ConfigVersion = "abc123" + inst := &Instance{Tags: []ec2types.Tag{}} + if !h.isCurrentConfig(inst) { + t.Error("expected true — missing tag means nothing to compare") + } + }) + + t.Run("empty config version skips check", func(t *testing.T) { + h := newTestHandler(nil) + h.ConfigVersion = "" + inst := &Instance{Tags: []ec2types.Tag{}} + if !h.isCurrentConfig(inst) { + t.Error("expected true") + } + }) +} + +// --- replaceNAT() --- + +func TestReplaceNAT(t *testing.T) { + setupLTAndAMI := func(mock *mockEC2) { + mock.DescribeLaunchTemplatesFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplatesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplatesOutput, error) { + return &ec2.DescribeLaunchTemplatesOutput{ + LaunchTemplates: []ec2types.LaunchTemplate{{LaunchTemplateId: aws.String("lt-123")}}, + }, nil + } + mock.DescribeLaunchTemplateVersionsFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplateVersionsInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplateVersionsOutput, error) { + return &ec2.DescribeLaunchTemplateVersionsOutput{ + LaunchTemplateVersions: []ec2types.LaunchTemplateVersion{{ + LaunchTemplateId: aws.String("lt-123"), VersionNumber: aws.Int64(1), + }}, + }, nil + } + mock.DescribeImagesFn = func(ctx context.Context, params *ec2.DescribeImagesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeImagesOutput, error) { + return &ec2.DescribeImagesOutput{Images: []ec2types.Image{}}, nil + } + mock.RunInstancesFn = func(ctx context.Context, params *ec2.RunInstancesInput, optFns ...func(*ec2.Options)) (*ec2.RunInstancesOutput, error) { + return &ec2.RunInstancesOutput{ + Instances: []ec2types.Instance{{InstanceId: aws.String("i-new")}}, + }, nil + } + } + + t.Run("happy path", func(t *testing.T) { + mock := &mockEC2{} + setupLTAndAMI(mock) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(makeTestInstance("i-old", "terminated", testVPC, testAZ, nil, nil)), nil + } + mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { + return &ec2.DescribeNetworkInterfacesOutput{ + NetworkInterfaces: []ec2types.NetworkInterface{{ + NetworkInterfaceId: aws.String("eni-1"), + Status: ec2types.NetworkInterfaceStatusAvailable, + }}, + }, nil + } + h := newTestHandler(mock) + eni := makeENI("eni-1", 0, "10.0.1.10", nil) + inst := &Instance{ + InstanceID: "i-old", + StateName: "running", + NetworkInterfaces: []ec2types.InstanceNetworkInterface{eni}, + } + result := h.replaceNAT(context.Background(), inst, testAZ, testVPC) + if result != "i-new" { + t.Errorf("expected i-new, got %s", result) + } + if mock.callCount("TerminateInstances") != 1 { + t.Error("expected TerminateInstances to be called") + } + }) + + t.Run("ENI wait polls until available", func(t *testing.T) { + mock := &mockEC2{} + setupLTAndAMI(mock) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(makeTestInstance("i-old", "terminated", testVPC, testAZ, nil, nil)), nil + } + var niIdx int32 + mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { + idx := atomic.AddInt32(&niIdx, 1) + if idx == 1 { + return &ec2.DescribeNetworkInterfacesOutput{ + NetworkInterfaces: []ec2types.NetworkInterface{ + {NetworkInterfaceId: aws.String("eni-1"), Status: ec2types.NetworkInterfaceStatusInUse}, + {NetworkInterfaceId: aws.String("eni-2"), Status: ec2types.NetworkInterfaceStatusInUse}, + }, + }, nil + } + return &ec2.DescribeNetworkInterfacesOutput{ + NetworkInterfaces: []ec2types.NetworkInterface{ + {NetworkInterfaceId: aws.String("eni-1"), Status: ec2types.NetworkInterfaceStatusAvailable}, + {NetworkInterfaceId: aws.String("eni-2"), Status: ec2types.NetworkInterfaceStatusAvailable}, + }, + }, nil + } + h := newTestHandler(mock) + eni1 := makeENI("eni-1", 0, "10.0.1.10", nil) + eni2 := makeENI("eni-2", 1, "10.0.1.11", nil) + inst := &Instance{ + InstanceID: "i-old", + StateName: "running", + NetworkInterfaces: []ec2types.InstanceNetworkInterface{eni1, eni2}, + } + result := h.replaceNAT(context.Background(), inst, testAZ, testVPC) + if result != "i-new" { + t.Errorf("expected i-new, got %s", result) + } + if mock.callCount("DescribeNetworkInterfaces") != 2 { + t.Errorf("expected 2 DescribeNetworkInterfaces calls, got %d", mock.callCount("DescribeNetworkInterfaces")) + } + }) + + t.Run("no ENIs skips wait", func(t *testing.T) { + mock := &mockEC2{} + setupLTAndAMI(mock) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(makeTestInstance("i-old", "terminated", testVPC, testAZ, nil, nil)), nil + } + h := newTestHandler(mock) + inst := &Instance{ + InstanceID: "i-old", + StateName: "running", + NetworkInterfaces: nil, + } + result := h.replaceNAT(context.Background(), inst, testAZ, testVPC) + if result != "i-new" { + t.Errorf("expected i-new, got %s", result) + } + if mock.callCount("DescribeNetworkInterfaces") != 0 { + t.Error("expected no DescribeNetworkInterfaces calls") + } + }) +} + +// --- createNAT() config tag --- + +func TestCreateNATConfigTag(t *testing.T) { + setupLTAndAMI := func(mock *mockEC2) { + mock.DescribeLaunchTemplatesFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplatesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplatesOutput, error) { + return &ec2.DescribeLaunchTemplatesOutput{ + LaunchTemplates: []ec2types.LaunchTemplate{{LaunchTemplateId: aws.String("lt-123")}}, + }, nil + } + mock.DescribeLaunchTemplateVersionsFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplateVersionsInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplateVersionsOutput, error) { + return &ec2.DescribeLaunchTemplateVersionsOutput{ + LaunchTemplateVersions: []ec2types.LaunchTemplateVersion{{ + LaunchTemplateId: aws.String("lt-123"), VersionNumber: aws.Int64(1), + }}, + }, nil + } + mock.DescribeImagesFn = func(ctx context.Context, params *ec2.DescribeImagesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeImagesOutput, error) { + return &ec2.DescribeImagesOutput{Images: []ec2types.Image{}}, nil + } + } + + t.Run("includes config version tag", func(t *testing.T) { + mock := &mockEC2{} + setupLTAndAMI(mock) + mock.RunInstancesFn = func(ctx context.Context, params *ec2.RunInstancesInput, optFns ...func(*ec2.Options)) (*ec2.RunInstancesOutput, error) { + if len(params.TagSpecifications) == 0 { + t.Error("expected TagSpecifications") + } else { + found := false + for _, tag := range params.TagSpecifications[0].Tags { + if aws.ToString(tag.Key) == "ConfigVersion" && aws.ToString(tag.Value) == "abc123" { + found = true + } + } + if !found { + t.Error("expected ConfigVersion tag") + } + } + return &ec2.RunInstancesOutput{ + Instances: []ec2types.Instance{{InstanceId: aws.String("i-tagged")}}, + }, nil + } + h := newTestHandler(mock) + h.ConfigVersion = "abc123" + h.createNAT(context.Background(), testAZ, testVPC) + }) + + t.Run("no tag when config version empty", func(t *testing.T) { + mock := &mockEC2{} + setupLTAndAMI(mock) + mock.RunInstancesFn = func(ctx context.Context, params *ec2.RunInstancesInput, optFns ...func(*ec2.Options)) (*ec2.RunInstancesOutput, error) { + if len(params.TagSpecifications) != 0 { + t.Error("expected no TagSpecifications") + } + return &ec2.RunInstancesOutput{ + Instances: []ec2types.Instance{{InstanceId: aws.String("i-notag")}}, + }, nil + } + h := newTestHandler(mock) + h.ConfigVersion = "" + h.createNAT(context.Background(), testAZ, testVPC) + }) +} diff --git a/cmd/lambda/go.mod b/cmd/lambda/go.mod new file mode 100644 index 0000000..4134c72 --- /dev/null +++ b/cmd/lambda/go.mod @@ -0,0 +1,24 @@ +module github.com/MachineDotDev/nat-zero/cmd/lambda + +go 1.22 + +require ( + github.com/aws/aws-lambda-go v1.47.0 + github.com/aws/aws-sdk-go-v2 v1.32.7 + github.com/aws/aws-sdk-go-v2/config v1.28.7 + github.com/aws/aws-sdk-go-v2/service/ec2 v1.198.1 +) + +require ( + github.com/aws/aws-sdk-go-v2/credentials v1.17.48 // indirect + github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.16.22 // indirect + github.com/aws/aws-sdk-go-v2/internal/configsources v1.3.26 // indirect + github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.6.26 // indirect + github.com/aws/aws-sdk-go-v2/internal/ini v1.8.1 // indirect + github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.12.1 // indirect + github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.12.7 // indirect + github.com/aws/aws-sdk-go-v2/service/sso v1.24.8 // indirect + github.com/aws/aws-sdk-go-v2/service/ssooidc v1.28.7 // indirect + github.com/aws/aws-sdk-go-v2/service/sts v1.33.3 // indirect + github.com/aws/smithy-go v1.22.1 // indirect +) diff --git a/cmd/lambda/go.sum b/cmd/lambda/go.sum new file mode 100644 index 0000000..9d78818 --- /dev/null +++ b/cmd/lambda/go.sum @@ -0,0 +1,38 @@ +github.com/aws/aws-lambda-go v1.47.0 h1:0H8s0vumYx/YKs4sE7YM0ktwL2eWse+kfopsRI1sXVI= +github.com/aws/aws-lambda-go v1.47.0/go.mod h1:dpMpZgvWx5vuQJfBt0zqBha60q7Dd7RfgJv23DymV8A= +github.com/aws/aws-sdk-go-v2 v1.32.7 h1:ky5o35oENWi0JYWUZkB7WYvVPP+bcRF5/Iq7JWSb5Rw= +github.com/aws/aws-sdk-go-v2 v1.32.7/go.mod h1:P5WJBrYqqbWVaOxgH0X/FYYD47/nooaPOZPlQdmiN2U= +github.com/aws/aws-sdk-go-v2/config v1.28.7 h1:GduUnoTXlhkgnxTD93g1nv4tVPILbdNQOzav+Wpg7AE= +github.com/aws/aws-sdk-go-v2/config v1.28.7/go.mod h1:vZGX6GVkIE8uECSUHB6MWAUsd4ZcG2Yq/dMa4refR3M= +github.com/aws/aws-sdk-go-v2/credentials v1.17.48 h1:IYdLD1qTJ0zanRavulofmqut4afs45mOWEI+MzZtTfQ= +github.com/aws/aws-sdk-go-v2/credentials v1.17.48/go.mod h1:tOscxHN3CGmuX9idQ3+qbkzrjVIx32lqDSU1/0d/qXs= +github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.16.22 h1:kqOrpojG71DxJm/KDPO+Z/y1phm1JlC8/iT+5XRmAn8= +github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.16.22/go.mod h1:NtSFajXVVL8TA2QNngagVZmUtXciyrHOt7xgz4faS/M= +github.com/aws/aws-sdk-go-v2/internal/configsources v1.3.26 h1:I/5wmGMffY4happ8NOCuIUEWGUvvFp5NSeQcXl9RHcI= +github.com/aws/aws-sdk-go-v2/internal/configsources v1.3.26/go.mod h1:FR8f4turZtNy6baO0KJ5FJUmXH/cSkI9fOngs0yl6mA= +github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.6.26 h1:zXFLuEuMMUOvEARXFUVJdfqZ4bvvSgdGRq/ATcrQxzM= +github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.6.26/go.mod h1:3o2Wpy0bogG1kyOPrgkXA8pgIfEEv0+m19O9D5+W8y8= +github.com/aws/aws-sdk-go-v2/internal/ini v1.8.1 h1:VaRN3TlFdd6KxX1x3ILT5ynH6HvKgqdiXoTxAF4HQcQ= +github.com/aws/aws-sdk-go-v2/internal/ini v1.8.1/go.mod h1:FbtygfRFze9usAadmnGJNc8KsP346kEe+y2/oyhGAGc= +github.com/aws/aws-sdk-go-v2/service/ec2 v1.198.1 h1:YbNopxjd9baM83YEEmkaYHi+NuJt0AszeaSLqo0CVr0= +github.com/aws/aws-sdk-go-v2/service/ec2 v1.198.1/go.mod h1:mwr3iRm8u1+kkEx4ftDM2Q6Yr0XQFBKrP036ng+k5Lk= +github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.12.1 h1:iXtILhvDxB6kPvEXgsDhGaZCSC6LQET5ZHSdJozeI0Y= +github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.12.1/go.mod h1:9nu0fVANtYiAePIBh2/pFUSwtJ402hLnp854CNoDOeE= +github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.12.7 h1:8eUsivBQzZHqe/3FE+cqwfH+0p5Jo8PFM/QYQSmeZ+M= +github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.12.7/go.mod h1:kLPQvGUmxn/fqiCrDeohwG33bq2pQpGeY62yRO6Nrh0= +github.com/aws/aws-sdk-go-v2/service/sso v1.24.8 h1:CvuUmnXI7ebaUAhbJcDy9YQx8wHR69eZ9I7q5hszt/g= +github.com/aws/aws-sdk-go-v2/service/sso v1.24.8/go.mod h1:XDeGv1opzwm8ubxddF0cgqkZWsyOtw4lr6dxwmb6YQg= +github.com/aws/aws-sdk-go-v2/service/ssooidc v1.28.7 h1:F2rBfNAL5UyswqoeWv9zs74N/NanhK16ydHW1pahX6E= +github.com/aws/aws-sdk-go-v2/service/ssooidc v1.28.7/go.mod h1:JfyQ0g2JG8+Krq0EuZNnRwX0mU0HrwY/tG6JNfcqh4k= +github.com/aws/aws-sdk-go-v2/service/sts v1.33.3 h1:Xgv/hyNgvLda/M9l9qxXc4UFSgppnRczLxlMs5Ae/QY= +github.com/aws/aws-sdk-go-v2/service/sts v1.33.3/go.mod h1:5Gn+d+VaaRgsjewpMvGazt0WfcFO+Md4wLOuBfGR9Bc= +github.com/aws/smithy-go v1.22.1 h1:/HPHZQ0g7f4eUeK6HKglFz8uwVfZKgoI25rb/J+dnro= +github.com/aws/smithy-go v1.22.1/go.mod h1:irrKGvNn1InZwb2d7fkIRNucdfwR8R+Ts3wxYa/cJHg= +github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c= +github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= +github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM= +github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= +github.com/stretchr/testify v1.7.2 h1:4jaiDzPyXQvSd7D0EjG45355tLlV3VOECpq10pLC+8s= +github.com/stretchr/testify v1.7.2/go.mod h1:R6va5+xMeoiuVRoj+gSkQ7d3FALtqAAGI1FQKckRals= +gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA= +gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= diff --git a/cmd/lambda/handler.go b/cmd/lambda/handler.go new file mode 100644 index 0000000..1312405 --- /dev/null +++ b/cmd/lambda/handler.go @@ -0,0 +1,139 @@ +package main + +import ( + "context" + "log" + "time" +) + +// Event is the Lambda input payload. +type Event struct { + Action string `json:"action,omitempty"` + InstanceID string `json:"instance_id,omitempty"` + State string `json:"state,omitempty"` +} + +// Handler holds the EC2 client and configuration for the Lambda. +type Handler struct { + EC2 EC2API + NATTagKey string + NATTagValue string + IgnoreTagKey string + IgnoreTagValue string + TargetVPC string + AMIOwner string + AMIPattern string + ConfigVersion string + + // SleepFunc can be replaced in tests to eliminate real waits. + SleepFunc func(time.Duration) +} + +// HandleRequest is the Lambda entry point. +func (h *Handler) HandleRequest(ctx context.Context, event Event) error { + defer timed("handler total")() + return h.handle(ctx, event) +} + +func (h *Handler) handle(ctx context.Context, event Event) error { + if event.Action == "cleanup" { + log.Println("Running destroy-time cleanup") + h.cleanupAll(ctx) + return nil + } + + iid, state := event.InstanceID, event.State + log.Printf("instance=%s state=%s", iid, state) + + ignore, isNAT, az, vpc := h.classify(ctx, iid) + if ignore { + return nil + } + + // NAT events → manage EIP via EventBridge + if isNAT { + if isStarting(state) { + h.attachEIP(ctx, iid, az) + } else if isStopping(state) { + h.detachEIP(ctx, iid) + } + return nil + } + + // Workload events → manage NAT lifecycle + nat := h.findNAT(ctx, az, vpc) + + if isStarting(state) { + h.ensureNAT(ctx, nat, az, vpc) + return nil + } + + if isStopping(state) || isTerminating(state) { + h.maybeStopNAT(ctx, nat, az, vpc) + } + return nil +} + +// ensureNAT ensures a NAT instance is running in the given AZ. +func (h *Handler) ensureNAT(ctx context.Context, nat *Instance, az, vpc string) { + if nat == nil || isTerminating(nat.StateName) { + if nat != nil { + log.Printf("NAT %s terminated, creating new", nat.InstanceID) + } else { + log.Printf("Creating NAT in %s", az) + } + h.createNAT(ctx, az, vpc) + return + } + if !h.isCurrentConfig(nat) { + log.Printf("NAT %s has outdated config, replacing", nat.InstanceID) + h.replaceNAT(ctx, nat, az, vpc) + return + } + if isStopping(nat.StateName) { + log.Printf("Starting NAT %s", nat.InstanceID) + h.startNAT(ctx, nat, az) + } +} + +// maybeStopNAT stops the NAT if no sibling workloads remain. +func (h *Handler) maybeStopNAT(ctx context.Context, nat *Instance, az, vpc string) { + if nat == nil { + return + } + // Brief retry to let concurrent events settle. + for attempt := 0; attempt < 3; attempt++ { + if len(h.findSiblings(ctx, az, vpc)) > 0 { + log.Printf("Siblings still running in %s, keeping NAT", az) + return + } + if attempt < 2 { + h.sleep(2 * time.Second) + } + } + + if isStarting(nat.StateName) { + log.Printf("No siblings, stopping NAT %s", nat.InstanceID) + h.stopNAT(ctx, nat) + } +} + +func (h *Handler) sleep(d time.Duration) { + if h.SleepFunc != nil { + h.SleepFunc(d) + return + } + time.Sleep(d) +} + +func isStarting(state string) bool { + return state == "pending" || state == "running" +} + +func isStopping(state string) bool { + return state == "stopping" || state == "stopped" +} + +func isTerminating(state string) bool { + return state == "shutting-down" || state == "terminated" +} diff --git a/cmd/lambda/handler_test.go b/cmd/lambda/handler_test.go new file mode 100644 index 0000000..f82ba8a --- /dev/null +++ b/cmd/lambda/handler_test.go @@ -0,0 +1,704 @@ +package main + +import ( + "context" + "sync/atomic" + "testing" + + "github.com/aws/aws-sdk-go-v2/aws" + "github.com/aws/aws-sdk-go-v2/service/ec2" + ec2types "github.com/aws/aws-sdk-go-v2/service/ec2/types" +) + +// --- Cleanup action --- + +func TestHandlerCleanup(t *testing.T) { + t.Run("cleanup action calls cleanupAll", func(t *testing.T) { + mock := &mockEC2{} + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(), nil + } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{Addresses: []ec2types.Address{}}, nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{Action: "cleanup"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("DescribeInstances") == 0 { + t.Error("expected DescribeInstances to be called during cleanup") + } + }) + + t.Run("cleanup action ignores other fields", func(t *testing.T) { + mock := &mockEC2{} + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(), nil + } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{Addresses: []ec2types.Address{}}, nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{ + Action: "cleanup", InstanceID: "i-1", State: "running", + }) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + }) +} + +// --- Ignored instances --- + +func TestHandlerIgnored(t *testing.T) { + t.Run("ignored instance returns early", func(t *testing.T) { + mock := &mockEC2{} + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(), nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-skip", State: "running"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("DescribeInstances") != 1 { + t.Errorf("expected 1 DescribeInstances call (classify), got %d", mock.callCount("DescribeInstances")) + } + }) +} + +// --- NAT instance events (EventBridge-driven EIP management) --- + +func TestHandlerNatEvents(t *testing.T) { + natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} + + t.Run("running NAT triggers attachEIP", func(t *testing.T) { + mock := &mockEC2{} + eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, []ec2types.InstanceNetworkInterface{eni}) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(natInst), nil + } + mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { + return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-1"), PublicIp: aws.String("1.2.3.4")}, nil + } + mock.AssociateAddressFn = func(ctx context.Context, params *ec2.AssociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AssociateAddressOutput, error) { + return &ec2.AssociateAddressOutput{}, nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-nat1", State: "running"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("AllocateAddress") != 1 { + t.Error("expected AllocateAddress for NAT running event") + } + if mock.callCount("AssociateAddress") != 1 { + t.Error("expected AssociateAddress for NAT running event") + } + }) + + t.Run("pending NAT triggers attachEIP", func(t *testing.T) { + mock := &mockEC2{} + eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) + // classify returns pending NAT, then waitForState polls until running + var describeCount int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&describeCount, 1) + if idx == 1 { + // classify + return describeResponse(makeTestInstance("i-nat1", "pending", testVPC, testAZ, natTags, nil)), nil + } + // waitForState + getInstance — return running + return describeResponse(makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, []ec2types.InstanceNetworkInterface{eni})), nil + } + mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { + return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-1"), PublicIp: aws.String("1.2.3.4")}, nil + } + mock.AssociateAddressFn = func(ctx context.Context, params *ec2.AssociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AssociateAddressOutput, error) { + return &ec2.AssociateAddressOutput{}, nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-nat1", State: "pending"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("AllocateAddress") != 1 { + t.Error("expected AllocateAddress for NAT pending event") + } + }) + + t.Run("running NAT with existing EIP is noop", func(t *testing.T) { + mock := &mockEC2{} + assoc := &ec2types.InstanceNetworkInterfaceAssociation{PublicIp: aws.String("5.6.7.8")} + eni := makeENI("eni-pub1", 0, "10.0.1.10", assoc) + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, []ec2types.InstanceNetworkInterface{eni}) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(natInst), nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-nat1", State: "running"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("AllocateAddress") != 0 { + t.Error("expected no AllocateAddress when ENI already has EIP") + } + }) + + t.Run("stopped NAT triggers detachEIP", func(t *testing.T) { + mock := &mockEC2{} + eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) + natInst := makeTestInstance("i-nat1", "stopped", testVPC, testAZ, natTags, []ec2types.InstanceNetworkInterface{eni}) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(natInst), nil + } + mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { + return &ec2.DescribeNetworkInterfacesOutput{ + NetworkInterfaces: []ec2types.NetworkInterface{{ + NetworkInterfaceId: aws.String("eni-pub1"), + Association: &ec2types.NetworkInterfaceAssociation{ + AssociationId: aws.String("eipassoc-1"), + AllocationId: aws.String("eipalloc-1"), + PublicIp: aws.String("1.2.3.4"), + }, + }}, + }, nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-nat1", State: "stopped"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("DisassociateAddress") != 1 { + t.Error("expected DisassociateAddress for NAT stopped event") + } + if mock.callCount("ReleaseAddress") != 1 { + t.Error("expected ReleaseAddress for NAT stopped event") + } + }) + + t.Run("terminated NAT is noop", func(t *testing.T) { + mock := &mockEC2{} + natInst := makeTestInstance("i-nat1", "terminated", testVPC, testAZ, natTags, nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(natInst), nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-nat1", State: "terminated"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("AllocateAddress") != 0 || mock.callCount("DisassociateAddress") != 0 { + t.Error("expected no EIP operations for terminated NAT") + } + }) +} + +// --- Workload scale-up --- + +func TestHandlerWorkloadScaleUp(t *testing.T) { + workTags := []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}} + + t.Run("no NAT creates one", func(t *testing.T) { + mock := &mockEC2{} + workInst := makeTestInstance("i-work1", "pending", testVPC, testAZ, workTags, nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + if len(params.InstanceIds) > 0 && params.InstanceIds[0] == "i-work1" { + return describeResponse(workInst), nil + } + return describeResponse(), nil + } + mock.DescribeLaunchTemplatesFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplatesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplatesOutput, error) { + return &ec2.DescribeLaunchTemplatesOutput{ + LaunchTemplates: []ec2types.LaunchTemplate{{LaunchTemplateId: aws.String("lt-123")}}, + }, nil + } + mock.DescribeLaunchTemplateVersionsFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplateVersionsInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplateVersionsOutput, error) { + return &ec2.DescribeLaunchTemplateVersionsOutput{ + LaunchTemplateVersions: []ec2types.LaunchTemplateVersion{{ + LaunchTemplateId: aws.String("lt-123"), VersionNumber: aws.Int64(1), + }}, + }, nil + } + mock.DescribeImagesFn = func(ctx context.Context, params *ec2.DescribeImagesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeImagesOutput, error) { + return &ec2.DescribeImagesOutput{Images: []ec2types.Image{}}, nil + } + mock.RunInstancesFn = func(ctx context.Context, params *ec2.RunInstancesInput, optFns ...func(*ec2.Options)) (*ec2.RunInstancesOutput, error) { + return &ec2.RunInstancesOutput{ + Instances: []ec2types.Instance{{InstanceId: aws.String("i-new1")}}, + }, nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "pending"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("RunInstances") != 1 { + t.Error("expected RunInstances to be called") + } + // EIP is NOT managed inline anymore — no AllocateAddress expected + if mock.callCount("AllocateAddress") != 0 { + t.Error("expected no AllocateAddress (EIP managed via EventBridge)") + } + }) + + t.Run("stopped NAT starts it", func(t *testing.T) { + mock := &mockEC2{} + workInst := makeTestInstance("i-work1", "pending", testVPC, testAZ, workTags, nil) + natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} + natInst := makeTestInstance("i-nat1", "stopped", testVPC, testAZ, natTags, nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + if len(params.InstanceIds) > 0 && params.InstanceIds[0] == "i-work1" { + return describeResponse(workInst), nil + } + if params.Filters != nil { + return describeResponse(natInst), nil + } + // waitForState for NAT → return stopped + return describeResponse(natInst), nil + } + mock.StartInstancesFn = func(ctx context.Context, params *ec2.StartInstancesInput, optFns ...func(*ec2.Options)) (*ec2.StartInstancesOutput, error) { + return &ec2.StartInstancesOutput{}, nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "pending"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("StartInstances") != 1 { + t.Error("expected StartInstances to be called") + } + // EIP is NOT managed inline anymore + if mock.callCount("AllocateAddress") != 0 { + t.Error("expected no AllocateAddress (EIP managed via EventBridge)") + } + }) + + t.Run("running NAT is noop", func(t *testing.T) { + mock := &mockEC2{} + workInst := makeTestInstance("i-work1", "pending", testVPC, testAZ, workTags, nil) + natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) + var callIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&callIdx, 1) + if idx == 1 { + return describeResponse(workInst), nil + } + return describeResponse(natInst), nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "pending"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("RunInstances") != 0 { + t.Error("expected RunInstances NOT to be called") + } + if mock.callCount("StartInstances") != 0 { + t.Error("expected StartInstances NOT to be called") + } + }) + + t.Run("terminated NAT creates new", func(t *testing.T) { + mock := &mockEC2{} + workInst := makeTestInstance("i-work1", "running", testVPC, testAZ, workTags, nil) + natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} + natInst := makeTestInstance("i-nat1", "terminated", testVPC, testAZ, natTags, nil) + var callIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&callIdx, 1) + if idx == 1 { + return describeResponse(workInst), nil + } + return describeResponse(natInst), nil + } + mock.DescribeLaunchTemplatesFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplatesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplatesOutput, error) { + return &ec2.DescribeLaunchTemplatesOutput{ + LaunchTemplates: []ec2types.LaunchTemplate{{LaunchTemplateId: aws.String("lt-123")}}, + }, nil + } + mock.DescribeLaunchTemplateVersionsFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplateVersionsInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplateVersionsOutput, error) { + return &ec2.DescribeLaunchTemplateVersionsOutput{ + LaunchTemplateVersions: []ec2types.LaunchTemplateVersion{{ + LaunchTemplateId: aws.String("lt-123"), VersionNumber: aws.Int64(1), + }}, + }, nil + } + mock.DescribeImagesFn = func(ctx context.Context, params *ec2.DescribeImagesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeImagesOutput, error) { + return &ec2.DescribeImagesOutput{Images: []ec2types.Image{}}, nil + } + mock.RunInstancesFn = func(ctx context.Context, params *ec2.RunInstancesInput, optFns ...func(*ec2.Options)) (*ec2.RunInstancesOutput, error) { + return &ec2.RunInstancesOutput{ + Instances: []ec2types.Instance{{InstanceId: aws.String("i-new")}}, + }, nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "running"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("RunInstances") != 1 { + t.Error("expected RunInstances to be called") + } + }) + + t.Run("shutting-down NAT creates new", func(t *testing.T) { + mock := &mockEC2{} + workInst := makeTestInstance("i-work1", "pending", testVPC, testAZ, workTags, nil) + natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} + natInst := makeTestInstance("i-nat1", "shutting-down", testVPC, testAZ, natTags, nil) + var callIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&callIdx, 1) + if idx == 1 { + return describeResponse(workInst), nil + } + return describeResponse(natInst), nil + } + mock.DescribeLaunchTemplatesFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplatesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplatesOutput, error) { + return &ec2.DescribeLaunchTemplatesOutput{ + LaunchTemplates: []ec2types.LaunchTemplate{{LaunchTemplateId: aws.String("lt-123")}}, + }, nil + } + mock.DescribeLaunchTemplateVersionsFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplateVersionsInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplateVersionsOutput, error) { + return &ec2.DescribeLaunchTemplateVersionsOutput{ + LaunchTemplateVersions: []ec2types.LaunchTemplateVersion{{ + LaunchTemplateId: aws.String("lt-123"), VersionNumber: aws.Int64(1), + }}, + }, nil + } + mock.DescribeImagesFn = func(ctx context.Context, params *ec2.DescribeImagesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeImagesOutput, error) { + return &ec2.DescribeImagesOutput{Images: []ec2types.Image{}}, nil + } + mock.RunInstancesFn = func(ctx context.Context, params *ec2.RunInstancesInput, optFns ...func(*ec2.Options)) (*ec2.RunInstancesOutput, error) { + return &ec2.RunInstancesOutput{ + Instances: []ec2types.Instance{{InstanceId: aws.String("i-new")}}, + }, nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "pending"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("RunInstances") != 1 { + t.Error("expected RunInstances to be called") + } + }) +} + +// --- Workload scale-down --- + +func TestHandlerWorkloadScaleDown(t *testing.T) { + workTags := []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}} + natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} + + t.Run("no NAT returns early", func(t *testing.T) { + mock := &mockEC2{} + workInst := makeTestInstance("i-work1", "stopping", testVPC, testAZ, workTags, nil) + var callIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&callIdx, 1) + if idx == 1 { + return describeResponse(workInst), nil + } + return describeResponse(), nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "stopping"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("StopInstances") != 0 { + t.Error("expected StopInstances NOT to be called") + } + }) + + t.Run("siblings exist keeps NAT", func(t *testing.T) { + mock := &mockEC2{} + workInst := makeTestInstance("i-work1", "stopping", testVPC, testAZ, workTags, nil) + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) + sibInst := makeTestInstance("i-sib1", "running", testVPC, testAZ, workTags, nil) + var callIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&callIdx, 1) + if idx == 1 { + return describeResponse(workInst), nil + } + if params.Filters != nil { + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil + } + } + } + return describeResponse(sibInst), nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "stopping"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("StopInstances") != 0 { + t.Error("expected StopInstances NOT to be called") + } + }) + + t.Run("no siblings stops running NAT", func(t *testing.T) { + mock := &mockEC2{} + workInst := makeTestInstance("i-work1", "terminated", testVPC, testAZ, workTags, nil) + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) + var callIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&callIdx, 1) + if idx == 1 { + return describeResponse(workInst), nil + } + if params.Filters != nil { + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil + } + } + } + return describeResponse(), nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "terminated"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("StopInstances") != 1 { + t.Errorf("expected StopInstances to be called once, got %d", mock.callCount("StopInstances")) + } + // EIP is NOT released inline anymore — detachEIP happens via EventBridge + if mock.callCount("DisassociateAddress") != 0 { + t.Error("expected no DisassociateAddress (EIP managed via EventBridge)") + } + }) + + t.Run("no siblings NAT already stopped is noop", func(t *testing.T) { + mock := &mockEC2{} + workInst := makeTestInstance("i-work1", "stopping", testVPC, testAZ, workTags, nil) + natInst := makeTestInstance("i-nat1", "stopped", testVPC, testAZ, natTags, nil) + var callIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&callIdx, 1) + if idx == 1 { + return describeResponse(workInst), nil + } + if params.Filters != nil { + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil + } + } + } + return describeResponse(), nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "stopping"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("StopInstances") != 0 { + t.Error("expected StopInstances NOT to be called") + } + }) + + t.Run("siblings appear on retry", func(t *testing.T) { + mock := &mockEC2{} + workInst := makeTestInstance("i-work1", "stopping", testVPC, testAZ, workTags, nil) + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) + sibInst := makeTestInstance("i-sib1", "running", testVPC, testAZ, workTags, nil) + var callIdx int32 + var sibCallIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&callIdx, 1) + if idx == 1 { + return describeResponse(workInst), nil + } + if params.Filters != nil { + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil + } + } + } + sibIdx := atomic.AddInt32(&sibCallIdx, 1) + if sibIdx == 1 { + return describeResponse(), nil + } + return describeResponse(sibInst), nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "stopping"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("StopInstances") != 0 { + t.Error("expected StopInstances NOT to be called") + } + }) + + t.Run("pending NAT no siblings stops", func(t *testing.T) { + mock := &mockEC2{} + workInst := makeTestInstance("i-work1", "stopped", testVPC, testAZ, workTags, nil) + natInst := makeTestInstance("i-nat1", "pending", testVPC, testAZ, natTags, nil) + var callIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&callIdx, 1) + if idx == 1 { + return describeResponse(workInst), nil + } + if params.Filters != nil { + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil + } + } + } + return describeResponse(), nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "stopped"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("StopInstances") != 1 { + t.Errorf("expected StopInstances to be called once, got %d", mock.callCount("StopInstances")) + } + }) +} + +// --- Config version replacement --- + +func TestHandlerConfigVersion(t *testing.T) { + workTags := []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}} + natTags := []ec2types.Tag{ + {Key: aws.String("nat-zero:managed"), Value: aws.String("true")}, + {Key: aws.String("ConfigVersion"), Value: aws.String("old456")}, + } + + t.Run("outdated config triggers replace", func(t *testing.T) { + mock := &mockEC2{} + workInst := makeTestInstance("i-work1", "running", testVPC, testAZ, workTags, nil) + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) + var callIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&callIdx, 1) + if idx == 1 { + return describeResponse(workInst), nil + } + if params.Filters != nil { + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil + } + } + } + return describeResponse(makeTestInstance("i-nat1", "terminated", testVPC, testAZ, natTags, nil)), nil + } + mock.DescribeLaunchTemplatesFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplatesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplatesOutput, error) { + return &ec2.DescribeLaunchTemplatesOutput{ + LaunchTemplates: []ec2types.LaunchTemplate{{LaunchTemplateId: aws.String("lt-123")}}, + }, nil + } + mock.DescribeLaunchTemplateVersionsFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplateVersionsInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplateVersionsOutput, error) { + return &ec2.DescribeLaunchTemplateVersionsOutput{ + LaunchTemplateVersions: []ec2types.LaunchTemplateVersion{{ + LaunchTemplateId: aws.String("lt-123"), VersionNumber: aws.Int64(1), + }}, + }, nil + } + mock.DescribeImagesFn = func(ctx context.Context, params *ec2.DescribeImagesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeImagesOutput, error) { + return &ec2.DescribeImagesOutput{Images: []ec2types.Image{}}, nil + } + mock.RunInstancesFn = func(ctx context.Context, params *ec2.RunInstancesInput, optFns ...func(*ec2.Options)) (*ec2.RunInstancesOutput, error) { + return &ec2.RunInstancesOutput{ + Instances: []ec2types.Instance{{InstanceId: aws.String("i-new")}}, + }, nil + } + h := newTestHandler(mock) + h.ConfigVersion = "abc123" + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "running"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("TerminateInstances") != 1 { + t.Error("expected TerminateInstances to be called (replace)") + } + if mock.callCount("RunInstances") != 1 { + t.Error("expected RunInstances to be called (create replacement)") + } + }) + + t.Run("missing config tag skips replace", func(t *testing.T) { + // When the ConfigVersion tag is absent (e.g. EC2 eventual consistency + // delay on a just-created instance, or an older NAT), there is nothing + // to compare against so isCurrentConfig returns true and no replacement + // happens. + mock := &mockEC2{} + workInst := makeTestInstance("i-work1", "running", testVPC, testAZ, workTags, nil) + noVersionTags := []ec2types.Tag{ + {Key: aws.String("nat-zero:managed"), Value: aws.String("true")}, + // No ConfigVersion tag + } + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, noVersionTags, nil) + var callIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&callIdx, 1) + if idx == 1 { + return describeResponse(workInst), nil + } + return describeResponse(natInst), nil + } + h := newTestHandler(mock) + h.ConfigVersion = "abc123" + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "running"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("TerminateInstances") != 0 { + t.Error("expected TerminateInstances NOT to be called when tag is missing") + } + if mock.callCount("RunInstances") != 0 { + t.Error("expected RunInstances NOT to be called when tag is missing") + } + }) + + t.Run("current config is noop", func(t *testing.T) { + mock := &mockEC2{} + workInst := makeTestInstance("i-work1", "pending", testVPC, testAZ, workTags, nil) + currentTags := []ec2types.Tag{ + {Key: aws.String("nat-zero:managed"), Value: aws.String("true")}, + {Key: aws.String("ConfigVersion"), Value: aws.String("abc123")}, + } + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, currentTags, nil) + var callIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&callIdx, 1) + if idx == 1 { + return describeResponse(workInst), nil + } + return describeResponse(natInst), nil + } + h := newTestHandler(mock) + h.ConfigVersion = "abc123" + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "pending"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("RunInstances") != 0 { + t.Error("expected RunInstances NOT to be called") + } + if mock.callCount("StartInstances") != 0 { + t.Error("expected StartInstances NOT to be called") + } + if mock.callCount("TerminateInstances") != 0 { + t.Error("expected TerminateInstances NOT to be called") + } + }) +} diff --git a/cmd/lambda/main.go b/cmd/lambda/main.go new file mode 100644 index 0000000..bfe950c --- /dev/null +++ b/cmd/lambda/main.go @@ -0,0 +1,38 @@ +package main + +import ( + "context" + "os" + + "github.com/aws/aws-lambda-go/lambda" + "github.com/aws/aws-sdk-go-v2/config" + "github.com/aws/aws-sdk-go-v2/service/ec2" +) + +func envOr(key, fallback string) string { + if v := os.Getenv(key); v != "" { + return v + } + return fallback +} + +func main() { + cfg, err := config.LoadDefaultConfig(context.Background()) + if err != nil { + panic("unable to load AWS config: " + err.Error()) + } + + h := &Handler{ + EC2: ec2.NewFromConfig(cfg), + NATTagKey: envOr("NAT_TAG_KEY", "nat-zero:managed"), + NATTagValue: envOr("NAT_TAG_VALUE", "true"), + IgnoreTagKey: envOr("IGNORE_TAG_KEY", "nat-zero:ignore"), + IgnoreTagValue: envOr("IGNORE_TAG_VALUE", "true"), + TargetVPC: os.Getenv("TARGET_VPC_ID"), + AMIOwner: envOr("AMI_OWNER_ACCOUNT", "568608671756"), + AMIPattern: envOr("AMI_NAME_PATTERN", "fck-nat-al2023-*-arm64-*"), + ConfigVersion: os.Getenv("CONFIG_VERSION"), + } + + lambda.Start(h.HandleRequest) +} diff --git a/cmd/lambda/mock_test.go b/cmd/lambda/mock_test.go new file mode 100644 index 0000000..0a20583 --- /dev/null +++ b/cmd/lambda/mock_test.go @@ -0,0 +1,229 @@ +package main + +import ( + "context" + "sync" + "time" + + "github.com/aws/aws-sdk-go-v2/aws" + "github.com/aws/aws-sdk-go-v2/service/ec2" + ec2types "github.com/aws/aws-sdk-go-v2/service/ec2/types" +) + +// mockEC2 implements EC2API with per-method function fields for test control. +type mockEC2 struct { + DescribeInstancesFn func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) + RunInstancesFn func(ctx context.Context, params *ec2.RunInstancesInput, optFns ...func(*ec2.Options)) (*ec2.RunInstancesOutput, error) + StartInstancesFn func(ctx context.Context, params *ec2.StartInstancesInput, optFns ...func(*ec2.Options)) (*ec2.StartInstancesOutput, error) + StopInstancesFn func(ctx context.Context, params *ec2.StopInstancesInput, optFns ...func(*ec2.Options)) (*ec2.StopInstancesOutput, error) + TerminateInstancesFn func(ctx context.Context, params *ec2.TerminateInstancesInput, optFns ...func(*ec2.Options)) (*ec2.TerminateInstancesOutput, error) + AllocateAddressFn func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) + AssociateAddressFn func(ctx context.Context, params *ec2.AssociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AssociateAddressOutput, error) + DisassociateAddressFn func(ctx context.Context, params *ec2.DisassociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.DisassociateAddressOutput, error) + ReleaseAddressFn func(ctx context.Context, params *ec2.ReleaseAddressInput, optFns ...func(*ec2.Options)) (*ec2.ReleaseAddressOutput, error) + DescribeAddressesFn func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) + DescribeNetworkInterfacesFn func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) + DescribeImagesFn func(ctx context.Context, params *ec2.DescribeImagesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeImagesOutput, error) + DescribeLaunchTemplatesFn func(ctx context.Context, params *ec2.DescribeLaunchTemplatesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplatesOutput, error) + DescribeLaunchTemplateVersionsFn func(ctx context.Context, params *ec2.DescribeLaunchTemplateVersionsInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplateVersionsOutput, error) + + // Call tracking (mutex-protected for concurrent access) + mu sync.Mutex + Calls []mockCall +} + +type mockCall struct { + Method string + Input interface{} +} + +func (m *mockEC2) track(method string, input interface{}) { + m.mu.Lock() + m.Calls = append(m.Calls, mockCall{Method: method, Input: input}) + m.mu.Unlock() +} + +func (m *mockEC2) callCount(method string) int { + m.mu.Lock() + defer m.mu.Unlock() + n := 0 + for _, c := range m.Calls { + if c.Method == method { + n++ + } + } + return n +} + +func (m *mockEC2) DescribeInstances(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + m.track("DescribeInstances", params) + if m.DescribeInstancesFn != nil { + return m.DescribeInstancesFn(ctx, params, optFns...) + } + return &ec2.DescribeInstancesOutput{}, nil +} + +func (m *mockEC2) RunInstances(ctx context.Context, params *ec2.RunInstancesInput, optFns ...func(*ec2.Options)) (*ec2.RunInstancesOutput, error) { + m.track("RunInstances", params) + if m.RunInstancesFn != nil { + return m.RunInstancesFn(ctx, params, optFns...) + } + return &ec2.RunInstancesOutput{}, nil +} + +func (m *mockEC2) StartInstances(ctx context.Context, params *ec2.StartInstancesInput, optFns ...func(*ec2.Options)) (*ec2.StartInstancesOutput, error) { + m.track("StartInstances", params) + if m.StartInstancesFn != nil { + return m.StartInstancesFn(ctx, params, optFns...) + } + return &ec2.StartInstancesOutput{}, nil +} + +func (m *mockEC2) StopInstances(ctx context.Context, params *ec2.StopInstancesInput, optFns ...func(*ec2.Options)) (*ec2.StopInstancesOutput, error) { + m.track("StopInstances", params) + if m.StopInstancesFn != nil { + return m.StopInstancesFn(ctx, params, optFns...) + } + return &ec2.StopInstancesOutput{}, nil +} + +func (m *mockEC2) TerminateInstances(ctx context.Context, params *ec2.TerminateInstancesInput, optFns ...func(*ec2.Options)) (*ec2.TerminateInstancesOutput, error) { + m.track("TerminateInstances", params) + if m.TerminateInstancesFn != nil { + return m.TerminateInstancesFn(ctx, params, optFns...) + } + return &ec2.TerminateInstancesOutput{}, nil +} + +func (m *mockEC2) AllocateAddress(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { + m.track("AllocateAddress", params) + if m.AllocateAddressFn != nil { + return m.AllocateAddressFn(ctx, params, optFns...) + } + return &ec2.AllocateAddressOutput{}, nil +} + +func (m *mockEC2) AssociateAddress(ctx context.Context, params *ec2.AssociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AssociateAddressOutput, error) { + m.track("AssociateAddress", params) + if m.AssociateAddressFn != nil { + return m.AssociateAddressFn(ctx, params, optFns...) + } + return &ec2.AssociateAddressOutput{}, nil +} + +func (m *mockEC2) DisassociateAddress(ctx context.Context, params *ec2.DisassociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.DisassociateAddressOutput, error) { + m.track("DisassociateAddress", params) + if m.DisassociateAddressFn != nil { + return m.DisassociateAddressFn(ctx, params, optFns...) + } + return &ec2.DisassociateAddressOutput{}, nil +} + +func (m *mockEC2) ReleaseAddress(ctx context.Context, params *ec2.ReleaseAddressInput, optFns ...func(*ec2.Options)) (*ec2.ReleaseAddressOutput, error) { + m.track("ReleaseAddress", params) + if m.ReleaseAddressFn != nil { + return m.ReleaseAddressFn(ctx, params, optFns...) + } + return &ec2.ReleaseAddressOutput{}, nil +} + +func (m *mockEC2) DescribeAddresses(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + m.track("DescribeAddresses", params) + if m.DescribeAddressesFn != nil { + return m.DescribeAddressesFn(ctx, params, optFns...) + } + return &ec2.DescribeAddressesOutput{}, nil +} + +func (m *mockEC2) DescribeNetworkInterfaces(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { + m.track("DescribeNetworkInterfaces", params) + if m.DescribeNetworkInterfacesFn != nil { + return m.DescribeNetworkInterfacesFn(ctx, params, optFns...) + } + return &ec2.DescribeNetworkInterfacesOutput{}, nil +} + +func (m *mockEC2) DescribeImages(ctx context.Context, params *ec2.DescribeImagesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeImagesOutput, error) { + m.track("DescribeImages", params) + if m.DescribeImagesFn != nil { + return m.DescribeImagesFn(ctx, params, optFns...) + } + return &ec2.DescribeImagesOutput{}, nil +} + +func (m *mockEC2) DescribeLaunchTemplates(ctx context.Context, params *ec2.DescribeLaunchTemplatesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplatesOutput, error) { + m.track("DescribeLaunchTemplates", params) + if m.DescribeLaunchTemplatesFn != nil { + return m.DescribeLaunchTemplatesFn(ctx, params, optFns...) + } + return &ec2.DescribeLaunchTemplatesOutput{}, nil +} + +func (m *mockEC2) DescribeLaunchTemplateVersions(ctx context.Context, params *ec2.DescribeLaunchTemplateVersionsInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplateVersionsOutput, error) { + m.track("DescribeLaunchTemplateVersions", params) + if m.DescribeLaunchTemplateVersionsFn != nil { + return m.DescribeLaunchTemplateVersionsFn(ctx, params, optFns...) + } + return &ec2.DescribeLaunchTemplateVersionsOutput{}, nil +} + +// --- Test helper builders --- + +const ( + testVPC = "vpc-test123" + testAZ = "us-east-1a" +) + +func makeTestInstance(id, state, vpcID, az string, tags []ec2types.Tag, enis []ec2types.InstanceNetworkInterface) ec2types.Instance { + stateCode := map[string]int32{ + "pending": 0, "running": 16, "shutting-down": 32, + "terminated": 48, "stopping": 64, "stopped": 80, + } + return ec2types.Instance{ + InstanceId: aws.String(id), + State: &ec2types.InstanceState{ + Name: ec2types.InstanceStateName(state), + Code: aws.Int32(stateCode[state]), + }, + VpcId: aws.String(vpcID), + Placement: &ec2types.Placement{AvailabilityZone: aws.String(az)}, + Tags: tags, + NetworkInterfaces: enis, + } +} + +func makeENI(id string, deviceIndex int32, privateIP string, association *ec2types.InstanceNetworkInterfaceAssociation) ec2types.InstanceNetworkInterface { + eni := ec2types.InstanceNetworkInterface{ + NetworkInterfaceId: aws.String(id), + Attachment: &ec2types.InstanceNetworkInterfaceAttachment{DeviceIndex: aws.Int32(deviceIndex)}, + PrivateIpAddress: aws.String(privateIP), + } + if association != nil { + eni.Association = association + } + return eni +} + +func describeResponse(instances ...ec2types.Instance) *ec2.DescribeInstancesOutput { + if len(instances) == 0 { + return &ec2.DescribeInstancesOutput{Reservations: []ec2types.Reservation{}} + } + return &ec2.DescribeInstancesOutput{ + Reservations: []ec2types.Reservation{{Instances: instances}}, + } +} + +func newTestHandler(mock *mockEC2) *Handler { + return &Handler{ + EC2: mock, + NATTagKey: "nat-zero:managed", + NATTagValue: "true", + IgnoreTagKey: "nat-zero:ignore", + IgnoreTagValue: "true", + TargetVPC: testVPC, + AMIOwner: "568608671756", + AMIPattern: "fck-nat-al2023-*-arm64-*", + ConfigVersion: "", + SleepFunc: func(d time.Duration) {}, // no-op sleep + } +} diff --git a/cmd/lambda/perf.go b/cmd/lambda/perf.go new file mode 100644 index 0000000..6747aa1 --- /dev/null +++ b/cmd/lambda/perf.go @@ -0,0 +1,15 @@ +package main + +import ( + "log" + "time" +) + +// timed returns a function that, when called, logs the elapsed time since +// timed() was called. Usage: defer timed("label")() +func timed(label string) func() { + start := time.Now() + return func() { + log.Printf("%s: %dms", label, time.Since(start).Milliseconds()) + } +} diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md new file mode 100644 index 0000000..32ae603 --- /dev/null +++ b/docs/ARCHITECTURE.md @@ -0,0 +1,283 @@ +# Architecture + +## High-Level Overview + +The nat-zero module provides event-driven, scale-to-zero NAT instances for AWS. It uses EventBridge to capture EC2 instance state changes and a Lambda function to orchestrate the NAT instance lifecycle. + +``` + DATA PLANE + ┌──────────────────────────────────────────────────────────────────┐ + │ │ + │ Private Subnet NAT Instance Public Subnet │ + │ ┌─────────────┐ ┌───────────────────┐ ┌───────────────┐ │ + │ │ Workload │ │ Linux Kernel │ │ Public ENI │ │ + │ │ Instance │───>│ iptables │───>│ (ens5) │──>│── Internet + │ │ │ │ MASQUERADE │ │ + EIP │ │ Gateway + │ └─────────────┘ └───────────────────┘ └───────────────┘ │ + │ │ Private ENI (ens6) │ + │ └──────────────────┘ │ + │ route 0.0.0.0/0 │ + └──────────────────────────────────────────────────────────────────┘ + + CONTROL PLANE + ┌──────────────────────────────────────────────────────────────────┐ + │ │ + │ ┌──────────────────┐ ┌──────────────────┐ │ + │ │ EventBridge │───>│ Lambda Function │ │ + │ │ EC2 State Change │ │ nat-zero │ │ + │ └──────────────────┘ └────────┬─────────┘ │ + │ │ │ + │ ┌──────────────┼──────────────┐ │ + │ │ │ │ │ + │ v v v │ + │ start/stop allocate/ on failure │ + │ NAT instance release EIP ┌─────────┐ │ + │ │ SQS DLQ │ │ + │ └─────────┘ │ + └──────────────────────────────────────────────────────────────────┘ +``` + +## Event Flow + +Every EC2 state change in the account fires an EventBridge event. The Lambda classifies each instance as: **ignore** (wrong VPC / ignore tag), **NAT** (has `nat-zero:managed=true` tag), or **workload** (everything else). + +### Scale-up: Workload starts, NAT created + +``` +1. Workload → pending + Lambda: classify → workload, starting + Action: findNAT → none → createNAT (RunInstances) + +2. NAT → pending + Lambda: classify → NAT, starting + Action: attachEIP → wait for running... (not yet, will retry on next event) + +3. NAT → running + Lambda: classify → NAT, starting + Action: attachEIP → instance running → allocate EIP → associate to public ENI + Result: NAT has internet via EIP ✓ + +4. Workload → running + Lambda: classify → workload, starting + Action: findNAT → found running NAT → no-op +``` + +### Scale-down: Workload terminates, NAT stopped + +``` +1. Workload → shutting-down + Lambda: classify → workload, terminating + Action: findNAT → found running NAT → findSiblings → none (3x retry) → stopNAT + +2. NAT → stopping + Lambda: classify → NAT, stopping + Action: detachEIP → wait for stopped... (not yet) + +3. NAT → stopped + Lambda: classify → NAT, stopping + Action: detachEIP → instance stopped → disassociate EIP → release EIP + Result: NAT idle, no EIP charge ✓ + +4. Workload → terminated + Lambda: classify → workload, terminating + Action: findNAT → found stopped NAT → NAT not in starting state → no-op +``` + +### Restart: New workload starts, stopped NAT restarted + +``` +1. New workload → pending + Lambda: classify → workload, starting + Action: findNAT → found stopped NAT → startNAT (wait stopped → StartInstances) + +2. NAT → pending + Lambda: classify → NAT, starting + Action: attachEIP → wait for running... (not yet) + +3. NAT → running + Lambda: classify → NAT, starting + Action: attachEIP → instance running → allocate EIP → associate to public ENI + Result: NAT has internet via EIP ✓ + +4. New workload → running + Lambda: classify → workload, starting + Action: findNAT → found running NAT → no-op +``` + +### Terraform destroy + +``` +Terraform invokes Lambda with {action: "cleanup"} +Action: find all NAT instances → terminate → release all EIPs +Result: clean state for ENI/SG destruction ✓ +``` + +### Why this is safe from races + +- **EIP attach is idempotent**: `attachEIP` checks if the ENI already has an EIP before allocating. Multiple concurrent `running` events for the same NAT are harmless. +- **EIP detach is idempotent**: `detachEIP` checks if the ENI has an association before releasing. +- **NAT dedup**: `findNAT` terminates extras if multiple NATs exist in one AZ. +- **Workload handlers never touch EIPs**: Only NAT events manage EIPs. Workload events only start/stop/create NAT instances. + +## Scale-Up Sequence + +``` + Workload EventBridge Lambda EC2 API NAT Instance + Instance (per AZ) + │ │ │ │ │ + │ state:"pending"│ │ │ │ + ├───────────────>│ │ │ │ + │ │ invoke │ │ │ + │ ├───────────────>│ │ │ + │ │ │ │ │ + │ │ │ describe_instances(id) │ + │ │ ├───────────────>│ │ + │ │ │<───────────────┤ │ + │ │ │ │ │ + │ │ │ Check: VPC matches? Not ignored? Not NAT? + │ │ │ │ │ + │ │ │ describe_instances(NAT tag, AZ, VPC) + │ │ ├───────────────>│ │ + │ │ │<───────────────┤ │ + │ │ │ │ │ + │ ┌─────┴────────────────┴────────────────┴────────┐ │ + │ │ IF no NAT instance: │ │ + │ │ describe_launch_templates(AZ, VPC) │ │ + │ │ describe_images(fck-nat pattern) │ │ + │ │ run_instances(template, AMI) ──────────────>│──────>│ Created + │ │ │ │ + │ │ IF NAT stopped: │ │ + │ │ start_instances(nat_id) ───────────────────>│──────>│ Starting + │ │ │ │ + │ │ IF NAT already running: │ │ + │ │ No action needed │ │ + │ └─────┬────────────────┬────────────────┬────────┘ │ + │ │ │ │ │ + │ │ │ │ state:"running" + │ │ invoke │ │<───────────────┤ + │ ├───────────────>│ │ │ + │ │ │ allocate_address │ + │ │ │ associate_address │ + │ │ ├───────────────>│ │ + │ │ │ │──── EIP ──────>│ + │ │ │ │ NAT ready +``` + +## Scale-Down Sequence + +``` + Workload EventBridge Lambda EC2 API NAT Instance + Instance (per AZ) + │ │ │ │ │ + │state:"stopping"│ │ │ │ + ├───────────────>│ │ │ │ + │ │ invoke │ │ │ + │ ├───────────────>│ │ │ + │ │ │ │ │ + │ │ │ describe_instances(id) │ + │ │ ├───────────────>│ │ + │ │ │ Check: VPC, not ignored, not NAT + │ │ │ │ │ + │ │ ┌──────────┴──────────┐ │ │ + │ │ │ Retry loop (3x, 2s) │ │ │ + │ │ │ describe_instances │ │ │ + │ │ │ (AZ, VPC, running) ├───>│ │ + │ │ │ filter out NAT + │<───┤ │ + │ │ │ ignored instances │ │ │ + │ │ └──────────┬──────────┘ │ │ + │ │ │ │ │ + │ ┌─────┴────────────────┴────────────────┴────────┐ │ + │ │ IF no siblings remain: │ │ + │ │ stop_instances(nat_id) ─────────────────────>│──────>│ Stopping + │ │ │ │ + │ │ IF siblings still running: │ │ + │ │ Keep NAT running, no action │ │ + │ └─────┬────────────────┬────────────────┬────────┘ │ + │ │ │ │ │ + │ │ │ │ state:"stopped" + │ │ invoke │ │<───────────────┤ + │ ├───────────────>│ │ │ + │ │ │ disassociate_address │ + │ │ │ release_address │ + │ │ ├───────────────>│ │ + │ │ │ │ EIP released │ + │ │ │ │ │ + │ │ │ │ NAT stopped │ + │ │ │ │ Cost: ~$0.80/mo + │ │ │ │ (EBS only) │ +``` + +## Dual ENI Architecture + +Each NAT instance uses two Elastic Network Interfaces (ENIs) to separate public and private traffic. ENIs are pre-created by Terraform and attached via the launch template, so they persist across instance stop/start cycles. + +``` + Private Subnet NAT Instance Public Subnet + ┌──────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐ + │ │ │ │ │ │ + │ Route Table │ │ ┌──────────────┐ │ │ │ + │ 0.0.0.0/0 ──────┼───>│ │ iptables │ │ │ │ + │ │ │ │ │ │ │ │ │ + │ v │ │ │ MASQUERADE │ │ │ │ + │ ┌────────────┐ │ │ │ on ens5 │ │ │ ┌────────────────┐ │ + │ │ Private ENI│ │ │ │ │───┼───>│ │ Public ENI │ │ + │ │ (ens6) │──┼───>│ │ FORWARD │ │ │ │ (ens5) │──┼──> Internet + │ │ │ │ │ │ ens6 → ens5 │ │ │ │ + EIP │ │ Gateway + │ │ No pub IP │ │ │ │ │ │ │ │ │ │ + │ │ src_dst=off│ │ │ │ RELATED, │ │ │ │ src_dst=off │ │ + │ └────────────┘ │ │ │ ESTABLISHED │ │ │ └────────────────┘ │ + │ │ │ └──────────────┘ │ │ │ + └──────────────────┘ └──────────────────────┘ └──────────────────────┘ +``` + +Key design decisions: +- **Pre-created ENIs**: ENIs are Terraform-managed and referenced in the launch template. They survive instance stop/start, preserving route table entries. +- **source_dest_check=false**: Required on both ENIs for NAT to work (instance forwards packets not addressed to itself). +- **EIP lifecycle**: Elastic IPs are allocated when the NAT instance reaches "running" and released when it reaches "stopped", both via EventBridge events. This avoids charges for unused EIPs. + +## Comparison with fck-nat + +This module complements fck-nat by adding scale-to-zero capability. + +``` + fck-nat (Always-On) nat-zero (Scale-to-Zero) + ┌────────────────────────────┐ ┌────────────────────────────────┐ + │ │ │ │ + │ ┌──────────────────────┐ │ │ ┌────────────┐ │ + │ │ Auto Scaling Group │ │ │ │ EventBridge │ │ + │ │ min=1, max=1 │ │ │ │ EC2 state │ │ + │ └──────────┬───────────┘ │ │ │ changes │ │ + │ │ │ │ └──────┬─────┘ │ + │ v │ │ │ │ + │ ┌──────────────────────┐ │ │ v │ + │ │ NAT Instance │ │ │ ┌────────────┐ │ + │ │ Always running │ │ │ │ Lambda │ │ + │ │ │ │ │ │ Orchestr. │ │ + │ └──────────────────────┘ │ │ └──────┬─────┘ │ + │ │ │ │ │ + │ Cost: ~$7-8/mo │ │ v │ + │ (instance + EIP 24/7) │ │ ┌────────────────────┐ │ + │ Self-healing via ASG │ │ │ NAT Instance │ │ + │ No Lambda needed │ │ │ Started on demand │ │ + └────────────────────────────┘ │ │ Stopped when idle │ │ + │ └────────────────────┘ │ + │ │ + │ Cost: ~$0.80/mo (idle) │ + │ EIP released when stopped │ + │ Zero IPv4 charge when idle │ + └────────────────────────────────┘ +``` + +Costs per AZ, per month. Includes the [AWS public IPv4 charge](https://aws.amazon.com/blogs/aws/new-aws-public-ipv4-address-charge-public-ip-insights/) ($3.60/mo per public IP, effective Feb 2024). + +| Aspect | fck-nat | nat-zero | +|--------|---------|-------------------| +| Architecture | ASG with min=1 | Lambda + EventBridge | +| Idle cost | ~$7-8/mo (instance + EIP 24/7) | ~$0.80/mo (EBS only, no EIP) | +| Active cost | ~$7-8/mo | ~$7-8/mo (same) | +| Public IPv4 charge | $3.60/mo always | $0 when idle (EIP released) | +| Scale-to-zero | No | Yes | +| Self-healing | ASG replaces unhealthy | Lambda creates new on demand | +| AMI | fck-nat AMI | fck-nat AMI (same) | +| Complexity | Low (ASG only) | Higher (Lambda + EventBridge) | +| Best for | Production 24/7 | Dev/staging, intermittent workloads | diff --git a/docs/EXAMPLES.md b/docs/EXAMPLES.md new file mode 100644 index 0000000..8d74007 --- /dev/null +++ b/docs/EXAMPLES.md @@ -0,0 +1,135 @@ +# Examples + +## Basic Usage + +A complete working example that creates a VPC with public and private subnets, then deploys nat-zero to provide scale-to-zero NAT for the private subnets. + +```hcl +terraform { + required_version = ">= 1.3" + + required_providers { + aws = { + source = "hashicorp/aws" + version = ">= 5.0" + } + } +} + +provider "aws" { + region = "us-east-1" +} + +data "aws_availability_zones" "available" { + state = "available" +} + +locals { + azs = slice(data.aws_availability_zones.available.names, 0, 2) +} + +module "vpc" { + source = "terraform-aws-modules/vpc/aws" + version = "~> 5.0" + + name = "nat-zero-example" + cidr = "10.0.0.0/16" + + azs = local.azs + public_subnets = ["10.0.1.0/24", "10.0.2.0/24"] + private_subnets = ["10.0.101.0/24", "10.0.102.0/24"] + + # Do NOT enable NAT gateway -- this module replaces it + enable_nat_gateway = false +} + +module "nat_zero" { + source = "github.com/MachineDotDev/nat-zero" + + name = "example-nat" + vpc_id = module.vpc.vpc_id + availability_zones = local.azs + public_subnets = module.vpc.public_subnets + private_subnets = module.vpc.private_subnets + + private_route_table_ids = module.vpc.private_route_table_ids + private_subnets_cidr_blocks = module.vpc.private_subnets_cidr_blocks + + # Defaults: t4g.nano, fck-nat AMI, on-demand + # Uncomment for spot instances: + # market_type = "spot" + + tags = { + Environment = "example" + ManagedBy = "terraform" + } +} + +output "lambda_function_name" { + value = module.nat_zero.lambda_function_name +} + +output "nat_security_group_ids" { + value = module.nat_zero.nat_security_group_ids +} +``` + +The full source is available at [`examples/basic/main.tf`](https://github.com/MachineDotDev/nat-zero/blob/main/examples/basic/main.tf). + +## Spot Instances + +To use spot instances (typically 60-70% cheaper than on-demand): + +```hcl +module "nat_zero" { + source = "github.com/MachineDotDev/nat-zero" + + # ... required variables ... + + market_type = "spot" +} +``` + +## Custom AMI + +To use a custom AMI instead of the default fck-nat AMI: + +```hcl +module "nat_zero" { + source = "github.com/MachineDotDev/nat-zero" + + # ... required variables ... + + use_fck_nat_ami = false + custom_ami_owner = "123456789012" + custom_ami_name_pattern = "my-nat-ami-*" +} +``` + +Or specify an AMI ID directly: + +```hcl +module "nat_zero" { + source = "github.com/MachineDotDev/nat-zero" + + # ... required variables ... + + ami_id = "ami-0123456789abcdef0" +} +``` + +## Building Lambda Locally + +For development or if you want to build from source: + +```hcl +module "nat_zero" { + source = "github.com/MachineDotDev/nat-zero" + + # ... required variables ... + + build_lambda_locally = true +} +``` + +Requires Go and `zip` installed locally. diff --git a/docs/INDEX.md b/docs/INDEX.md new file mode 100644 index 0000000..e342ec1 --- /dev/null +++ b/docs/INDEX.md @@ -0,0 +1,194 @@ +# nat-zero + +Scale-to-zero NAT instances for AWS. Uses [fck-nat](https://fck-nat.dev/) AMIs. Zero cost when idle. + +``` + CONTROL PLANE + ┌──────────────────────────────────────────────────┐ + │ EventBridge ──> Lambda (NAT Orchestrator) │ + │ │ start/stop instances │ + │ │ allocate/release EIPs │ + └────────────────────┼─────────────────────────────┘ + │ + ┌────────────┴────────────┐ + v v + AZ-A (active) AZ-B (idle) + ┌──────────────────┐ ┌──────────────────┐ + │ Workloads │ │ No workloads │ + │ ↓ route table │ │ No NAT instance │ + │ Private ENI │ │ No EIP │ + │ ↓ │ │ │ + │ NAT Instance │ │ Cost: ~$0.80/mo │ + │ ↓ │ │ (EBS only) │ + │ Public ENI + EIP │ │ │ + │ ↓ │ └──────────────────┘ + │ Internet Gateway │ + └──────────────────┘ +``` + +## How It Works + +An EventBridge rule captures all EC2 instance state changes. A Lambda function evaluates each event and manages NAT instance lifecycle per-AZ: + +- **Workload starts** in a private subnet → Lambda starts (or creates) a NAT instance in the same AZ and attaches an Elastic IP +- **Last workload stops** in an AZ → Lambda stops the NAT instance and releases the Elastic IP +- **NAT instance starts** → Lambda attaches an EIP to the public ENI +- **NAT instance stops** → Lambda detaches and releases the EIP + +Each NAT instance uses dual ENIs (public + private) pre-created by Terraform. Traffic from private subnets routes through the private ENI, gets masqueraded via iptables, and exits through the public ENI with an Elastic IP. + +See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for detailed diagrams, [docs/PERFORMANCE.md](docs/PERFORMANCE.md) for timing and cost data, and [docs/TEST.md](docs/TEST.md) for integration test documentation. + +## When To Use This Module + +| Use Case | This Module | fck-nat | NAT Gateway | +|---|---|---|---| +| Dev/staging with intermittent workloads | **Best fit** | Wasteful | Very wasteful | +| Production 24/7 workloads | Overkill | **Best fit** | Simplest | +| Cost-obsessive environments | **Best fit** | Good | Expensive | +| Simplicity priority | More moving parts | **Simpler** | Simplest | + +**Use this module** when your private subnet workloads run intermittently (CI/CD, dev environments, batch jobs) and you want to pay nothing when idle. + +**Use fck-nat** when workloads run 24/7 and you want simplicity with ASG self-healing. + +**Use NAT Gateway** when you prioritize simplicity and availability over cost. + +## Usage + +```hcl +module "nat_zero" { + source = "github.com/MachineDotDev/nat-zero" + + name = "my-nat" + vpc_id = module.vpc.vpc_id + availability_zones = ["us-east-1a", "us-east-1b"] + public_subnets = module.vpc.public_subnets + private_subnets = module.vpc.private_subnets + + private_route_table_ids = module.vpc.private_route_table_ids + private_subnets_cidr_blocks = module.vpc.private_subnets_cidr_blocks + + tags = { + Environment = "dev" + } +} +``` + +See [`examples/basic/`](examples/basic/) for a complete working example. + +## Cost Estimate + +Per AZ, per month. Accounts for the [AWS public IPv4 charge](https://aws.amazon.com/blogs/aws/new-aws-public-ipv4-address-charge-public-ip-insights/) ($0.005/hr per public IP, effective Feb 2024). + +| State | This Module | fck-nat | NAT Gateway | +|-------|------------|---------|-------------| +| **Idle** (no workloads) | **~$0.80** (EBS only) | ~$7-8 (instance + EIP) | ~$36+ ($32 gw + $3.60 IP) | +| **Active** (workloads running) | ~$7-8 (instance + EBS + EIP) | ~$7-8 (same) | ~$36+ (+ $0.045/GB) | + +Key cost difference: this module **releases the EIP when idle**, avoiding the $3.60/mo public IPv4 charge. fck-nat keeps an EIP attached 24/7. + +## Startup Latency + +| Scenario | Time to Connectivity | +|----------|---------------------| +| First workload in AZ (cold create) | **~15 seconds** | +| NAT already running | **Instant** | +| Restart from stopped (after idle) | **~12 seconds** | + +The first workload instance in an AZ will not have internet access for approximately 15 seconds. Design startup scripts to retry outbound connections. Subsequent instances in the same AZ get connectivity immediately since the route table already points to the running NAT. + +See [docs/PERFORMANCE.md](docs/PERFORMANCE.md) for detailed timing breakdowns and instance type benchmarks. + +## Important Notes + +- **EventBridge scope**: The EventBridge rule captures ALL EC2 state changes in the account. The Lambda filters events by VPC ID, so it only acts on instances in the target VPC. +- **EIP behavior**: An Elastic IP is allocated when a NAT instance starts and released when it stops. You are not charged for EIPs while the NAT instance is stopped. +- **fck-nat AMI**: By default, this module uses the public fck-nat AMI (`568608671756`). You can override this with `use_fck_nat_ami = false` and provide `custom_ami_owner` + `custom_ami_name_pattern`, or set `ami_id` directly. +- **Dual ENI**: Each AZ gets a pair of persistent ENIs (public + private). These survive instance stop/start cycles, preserving route table entries. +- **Dead Letter Queue**: Failed Lambda invocations are sent to an SQS DLQ for debugging. + +## Requirements + +| Name | Version | +|------|---------| +| terraform | >= 1.3 | +| aws | >= 5.0 | +| archive | >= 2.0 | + +## Providers + +| Name | Version | +|------|---------| +| aws | >= 5.0 | +| archive | >= 2.0 | + +## Resources + +| Name | Type | +|------|------| +| aws_cloudwatch_event_rule.ec2_state_change | resource | +| aws_cloudwatch_event_target.state_change_lambda_target | resource | +| aws_cloudwatch_log_group.nat_zero_logs | resource | +| aws_iam_instance_profile.nat_instance_profile | resource | +| aws_iam_role.lambda_iam_role | resource | +| aws_iam_role.nat_instance_role | resource | +| aws_iam_role_policy.lambda_iam_policy | resource | +| aws_iam_role_policy_attachment.lambda_basic_policy_attachment | resource | +| aws_iam_role_policy_attachment.ssm_policy_attachment | resource | +| aws_lambda_function.nat_zero | resource | +| aws_lambda_function_event_invoke_config.nat_zero_invoke_config | resource | +| aws_lambda_permission.allow_ec2_state_change_eventbridge | resource | +| aws_launch_template.nat_launch_template | resource | +| aws_network_interface.nat_private_network_interface | resource | +| aws_network_interface.nat_public_network_interface | resource | +| aws_route.nat_route | resource | +| aws_security_group.nat_security_group | resource | +| aws_sqs_queue.lambda_dlq | resource | +| archive_file.nat_zero | data source | + +## Inputs + +| Name | Description | Type | Default | Required | +|------|-------------|------|---------|:--------:| +| name | Name prefix for all resources | `string` | n/a | yes | +| vpc_id | VPC ID where NAT instances will be deployed | `string` | n/a | yes | +| availability_zones | List of AZs to deploy NAT instances in | `list(string)` | n/a | yes | +| public_subnets | Public subnet IDs (one per AZ) | `list(string)` | n/a | yes | +| private_subnets | Private subnet IDs (one per AZ) | `list(string)` | n/a | yes | +| private_route_table_ids | Route table IDs for private subnets (one per AZ) | `list(string)` | n/a | yes | +| private_subnets_cidr_blocks | CIDR blocks for private subnets (one per AZ) | `list(string)` | n/a | yes | +| tags | Additional tags for all resources | `map(string)` | `{}` | no | +| instance_type | EC2 instance type for NAT instances | `string` | `"t4g.nano"` | no | +| market_type | `"spot"` or `"on-demand"` | `string` | `"on-demand"` | no | +| block_device_size | Root volume size in GB | `number` | `2` | no | +| use_fck_nat_ami | Use the public fck-nat AMI | `bool` | `true` | no | +| ami_id | Explicit AMI ID (overrides lookup) | `string` | `null` | no | +| custom_ami_owner | AMI owner account when not using fck-nat | `string` | `null` | no | +| custom_ami_name_pattern | AMI name pattern when not using fck-nat | `string` | `null` | no | +| nat_tag_key | Tag key to identify NAT instances | `string` | `"nat-zero:managed"` | no | +| nat_tag_value | Tag value to identify NAT instances | `string` | `"true"` | no | +| ignore_tag_key | Tag key to mark instances the Lambda should ignore | `string` | `"nat-zero:ignore"` | no | +| ignore_tag_value | Tag value to mark instances the Lambda should ignore | `string` | `"true"` | no | +| log_retention_days | CloudWatch log retention in days | `number` | `14` | no | + +## Outputs + +| Name | Description | +|------|-------------| +| lambda_function_arn | ARN of the nat-zero Lambda function | +| lambda_function_name | Name of the nat-zero Lambda function | +| nat_security_group_ids | Security group IDs (one per AZ) | +| nat_public_eni_ids | Public ENI IDs (one per AZ) | +| nat_private_eni_ids | Private ENI IDs (one per AZ) | +| launch_template_ids | Launch template IDs (one per AZ) | +| eventbridge_rule_arn | ARN of the EventBridge rule | +| dlq_arn | ARN of the dead letter queue | + +## Contributing + +Contributions are welcome! Please open an issue or submit a pull request. + +## License + +MIT diff --git a/docs/PERFORMANCE.md b/docs/PERFORMANCE.md new file mode 100644 index 0000000..2a757e1 --- /dev/null +++ b/docs/PERFORMANCE.md @@ -0,0 +1,155 @@ +# Performance and Cost + +Startup latency, operational timing, instance type guidance, and cost comparisons for the nat-zero module. All measurements from integration tests running in us-east-1 with `t4g.nano` instances. + +## Startup Latency + +**First workload in an AZ — NAT created from scratch: ~15 seconds to connectivity.** + +``` + 0.0 s Workload instance enters "running" state + 0.3 s EventBridge delivers workload event to Lambda + 0.4 s Lambda cold start completes (55-67 ms init) + 0.9 s Lambda classifies instance, checks for existing NAT + 2.3 s Lambda calls RunInstances — NAT instance is now "pending" + Lambda returns. EIP will be attached separately via EventBridge. + +~12 s NAT instance reaches "running" state + (fck-nat AMI boots, configures iptables, attaches ENIs) +~12.3 s EventBridge delivers NAT "running" event to Lambda +~12.5 s Lambda allocates EIP and associates to public ENI + +~15 s Workload can reach the internet via NAT +``` + +The ~10 second gap between `RunInstances` and NAT reaching "running" is spent on EC2 placement (~2-3 s), OS boot (~3-4 s), and fck-nat network configuration (~2-3 s). This is consistent across all instance types tested — the bottleneck is EC2's instance lifecycle, not CPU or memory. + +### Restart from stopped state: ~12 seconds + +When a stopped NAT is restarted (new workload arrives after previous scale-down): + +``` + 0.0 s New workload enters "running" state + 0.3 s EventBridge delivers workload event to Lambda + 0.4 s Lambda classifies, finds stopped NAT → calls StartInstances + Lambda returns. + +~10 s NAT instance reaches "running" state (reboot from stopped) +~10.3 s EventBridge delivers NAT "running" event to Lambda +~10.5 s Lambda allocates EIP and associates to public ENI + +~12 s Workload can reach the internet via NAT +``` + +Restart is ~3 seconds faster than cold create because `StartInstances` is faster than `RunInstances` and skips AMI/launch template resolution. + +### NAT already running: instant + +If a NAT is already running in the AZ (e.g. second workload starts), no action is needed. The route table already points to the NAT's private ENI, so connectivity is immediate. + +### Summary table + +| Scenario | Lambda Duration | Time to NAT Running + EIP | Time to Connectivity | +|----------|-----------------|--------------------------|---------------------| +| First workload (cold create) | ~2 s | ~12 s | **~15 s** | +| NAT already running | — | — | **0 s** | +| Restart from stopped | ~0.5 s | ~10 s | **~12 s** | +| Config outdated (replace) | ~60+ s | ~12 s | **~70 s** | + +## Scale-Down Timing + +When the last workload in an AZ stops or terminates: + +``` + 0.0 s Last workload enters "shutting-down" state + 0.3 s EventBridge delivers workload event to Lambda + 0.4 s Lambda classifies, finds NAT, checks for sibling workloads + 4.5 s No siblings after 3 retries (2 s apart) → calls StopInstances + Lambda returns. + +~15 s NAT instance reaches "stopped" state +~15.3 s EventBridge delivers NAT "stopped" event to Lambda +~15.5 s Lambda disassociates and releases EIP + +~16 s EIP released, no IPv4 charge +``` + +The 3x retry with 2-second delays (~4 seconds total) is a safety margin to prevent flapping when instances are being replaced. The Lambda only checks for `pending` or `running` siblings — stopping or terminated instances don't count. + +## Lambda Execution + +The Lambda is a compiled Go binary on the `provided.al2023` runtime with 256 MB memory. + +| Metric | Duration | Notes | +|--------|----------|-------| +| Cold start (Init Duration) | 55-67 ms | Go binary; no interpreter overhead | +| classify (DescribeInstances) | 100-700 ms | Single API call; varies with API latency | +| findNAT (DescribeInstances) | 65-100 ms | Filter by tag + AZ + VPC | +| resolveAMI (DescribeImages) | 60-120 ms | Sorts by creation date | +| resolveLT (DescribeLaunchTemplates) | 70-100 ms | Filter by AZ + VPC tags | +| RunInstances | 1.2-1.6 s | AWS API latency | +| attachEIP (Allocate + Associate) | 150-300 ms | Includes idempotency check | +| detachEIP (Disassociate + Release) | 100-200 ms | Includes idempotency check | +| **Scale-up handler total** | **~2 s** | classify + findNAT + createNAT | +| **Scale-down handler total** | **~5 s** | classify + findNAT + 3x findSiblings + stopNAT | +| **attachEIP handler total** | **~0.5 s** | classify + waitForState + attachEIP | +| **detachEIP handler total** | **~0.5 s** | classify + waitForState + detachEIP | + +### Comparison with Python Lambda + +The previous Python implementation used the `python3.11` runtime with 128 MB memory. + +| Metric | Python 3.11 (128 MB) | Go (256 MB) | Improvement | +|--------|----------------------|-------------|-------------| +| Cold start | 667 ms | 55-67 ms | **~90% faster** | +| Handler total (scale-up) | 2,439 ms | ~2,000 ms | **~18% faster** | +| Max memory used | 98 MB | 30 MB | **69% less** | + +## What This Means for Your Workloads + +- **First workload takes ~15 seconds to get internet.** Design startup scripts to retry outbound connections (e.g. `apt update`, `pip install`, `curl`). Most package managers already retry. +- **Subsequent workloads are instant.** Once a NAT is running in an AZ, the route table already points to it. +- **Restart after idle is ~12 seconds.** If your workloads run sporadically (CI jobs, cron tasks), expect a ~12 second delay when the first job starts after an idle period. +- **Scale-down is conservative.** The Lambda waits 6 seconds (3 retries) before stopping a NAT, preventing flapping during instance replacements. +- **Instance type doesn't affect startup time.** The ~10 second EC2 boot time is the same for `t4g.nano` and `c7gn.medium`. + +## Cost + +Per AZ, per month. All prices are us-east-1 on-demand. Includes the [AWS public IPv4 charge](https://aws.amazon.com/blogs/aws/new-aws-public-ipv4-address-charge-public-ip-insights/) ($0.005/hr per public IP). + +### Idle vs active + +| State | nat-zero | fck-nat | NAT Gateway | +|-------|----------|---------|-------------| +| **Idle** (no workloads) | **~$0.80** | ~$7-8 | ~$36+ | +| **Active** (workloads running) | ~$7-8 | ~$7-8 | ~$36+ | + +**Idle breakdown**: EBS volume only (~$0.80/mo for 2 GB gp3). No instance running, no EIP allocated. + +**Active breakdown**: t4g.nano instance ($3.07/mo) + EIP ($3.60/mo) + EBS ($0.80/mo) = ~$7.50/mo. + +The key difference: nat-zero **releases the EIP when idle**, saving the $3.60/mo public IPv4 charge that fck-nat and NAT Gateway pay 24/7. + +### Instance type options + +| Instance Type | vCPUs | RAM | Network | $/hour | $/month (24x7) | $/month (12hr/day) | +|---------------|-------|-----|---------|--------|---------------|-------------------| +| **t4g.nano** (default) | 2 | 0.5 GiB | Up to 5 Gbps | $0.0042 | $3.07 | $1.53 | +| t4g.micro | 2 | 1 GiB | Up to 5 Gbps | $0.0084 | $6.13 | $3.07 | +| t4g.small | 2 | 2 GiB | Up to 5 Gbps | $0.0168 | $12.26 | $6.13 | +| c7gn.medium | 1 | 2 GiB | Up to 25 Gbps | $0.0624 | $45.55 | $22.78 | + +Spot pricing typically offers 60-70% savings on t4g instances. Use `market_type = "spot"` to enable. + +### Choosing an instance type + +**t4g.nano** (default) is right for most workloads: +- Handles typical dev/staging NAT traffic +- Burstable up to 5 Gbps with CPU credits +- $3/month on-demand, ~$1/month on spot + +**t4g.micro / t4g.small** — consider if you need sustained throughput beyond t4g.nano's baseline or workloads transfer large volumes consistently. + +**c7gn.medium** — consider if you need consistently high network throughput (up to 25 Gbps). At $45/month it's still cheaper than NAT Gateway for most data transfer patterns. + +Instance type does **not** affect startup time (~12 s regardless), only maximum sustained throughput and monthly cost. diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..30791e4 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,86 @@ +## Requirements + +| Name | Version | +|------|---------| +| [terraform](#requirement\_terraform) | >= 1.3 | +| [aws](#requirement\_aws) | >= 5.0 | +| [null](#requirement\_null) | >= 3.0 | +| [time](#requirement\_time) | >= 0.9 | + +## Providers + +| Name | Version | +|------|---------| +| [aws](#provider\_aws) | >= 5.0 | +| [null](#provider\_null) | >= 3.0 | +| [time](#provider\_time) | >= 0.9 | + +## Modules + +No modules. + +## Resources + +| Name | Type | +|------|------| +| [aws_cloudwatch_event_rule.ec2_state_change](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_event_rule) | resource | +| [aws_cloudwatch_event_target.state_change_lambda_target](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_event_target) | resource | +| [aws_cloudwatch_log_group.nat_zero_logs](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_log_group) | resource | +| [aws_iam_instance_profile.nat_instance_profile](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_instance_profile) | resource | +| [aws_iam_role.lambda_iam_role](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role) | resource | +| [aws_iam_role.nat_instance_role](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role) | resource | +| [aws_iam_role_policy.lambda_iam_policy](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy) | resource | +| [aws_iam_role_policy_attachment.ssm_policy_attachment](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy_attachment) | resource | +| [aws_lambda_function.nat_zero](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_function) | resource | +| [aws_lambda_function_event_invoke_config.nat_zero_invoke_config](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_function_event_invoke_config) | resource | +| [aws_lambda_invocation.cleanup](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_invocation) | resource | +| [aws_lambda_permission.allow_ec2_state_change_eventbridge](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_permission) | resource | +| [aws_launch_template.nat_launch_template](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/launch_template) | resource | +| [aws_network_interface.nat_private_network_interface](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/network_interface) | resource | +| [aws_network_interface.nat_public_network_interface](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/network_interface) | resource | +| [aws_route.nat_route](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/route) | resource | +| [aws_security_group.nat_security_group](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/security_group) | resource | +| [null_resource.build_lambda](https://registry.terraform.io/providers/hashicorp/null/latest/docs/resources/resource) | resource | +| [null_resource.download_lambda](https://registry.terraform.io/providers/hashicorp/null/latest/docs/resources/resource) | resource | +| [time_sleep.lambda_ready](https://registry.terraform.io/providers/hashicorp/time/latest/docs/resources/sleep) | resource | + +## Inputs + +| Name | Description | Type | Default | Required | +|------|-------------|------|---------|:--------:| +| [ami\_id](#input\_ami\_id) | Explicit AMI ID to use (overrides AMI lookup entirely) | `string` | `null` | no | +| [availability\_zones](#input\_availability\_zones) | List of availability zones to deploy NAT instances in | `list(string)` | n/a | yes | +| [block\_device\_size](#input\_block\_device\_size) | Size in GB of the root EBS volume | `number` | `10` | no | +| [build\_lambda\_locally](#input\_build\_lambda\_locally) | Build the Lambda binary from Go source instead of downloading a pre-compiled release. Requires Go and zip installed locally. | `bool` | `false` | no | +| [custom\_ami\_name\_pattern](#input\_custom\_ami\_name\_pattern) | AMI name pattern when use\_fck\_nat\_ami is false | `string` | `null` | no | +| [custom\_ami\_owner](#input\_custom\_ami\_owner) | AMI owner account ID when use\_fck\_nat\_ami is false | `string` | `null` | no | +| [enable\_logging](#input\_enable\_logging) | Create a CloudWatch log group for the Lambda function | `bool` | `true` | no | +| [ignore\_tag\_key](#input\_ignore\_tag\_key) | Tag key used to mark instances the Lambda should ignore | `string` | `"nat-zero:ignore"` | no | +| [ignore\_tag\_value](#input\_ignore\_tag\_value) | Tag value used to mark instances the Lambda should ignore | `string` | `"true"` | no | +| [instance\_type](#input\_instance\_type) | Instance type for the NAT instance | `string` | `"t4g.nano"` | no | +| [lambda\_binary\_url](#input\_lambda\_binary\_url) | URL to the pre-compiled Go Lambda zip. Updated automatically by CI. | `string` | `"https://github.com/MachineDotDev/nat-zero/releases/download/nat-zero-lambda-latest/lambda.zip"` | no | +| [lambda\_memory\_size](#input\_lambda\_memory\_size) | Memory allocated to the Lambda function in MB (also scales CPU proportionally) | `number` | `256` | no | +| [log\_retention\_days](#input\_log\_retention\_days) | CloudWatch log retention in days (only used when enable\_logging is true) | `number` | `14` | no | +| [market\_type](#input\_market\_type) | Whether to use spot or on-demand instances | `string` | `"on-demand"` | no | +| [name](#input\_name) | Name prefix for all resources created by this module | `string` | n/a | yes | +| [nat\_tag\_key](#input\_nat\_tag\_key) | Tag key used to identify NAT instances | `string` | `"nat-zero:managed"` | no | +| [nat\_tag\_value](#input\_nat\_tag\_value) | Tag value used to identify NAT instances | `string` | `"true"` | no | +| [private\_route\_table\_ids](#input\_private\_route\_table\_ids) | Route table IDs for the private subnets (one per AZ) | `list(string)` | n/a | yes | +| [private\_subnets](#input\_private\_subnets) | Private subnet IDs (one per AZ) for NAT instance private ENIs | `list(string)` | n/a | yes | +| [private\_subnets\_cidr\_blocks](#input\_private\_subnets\_cidr\_blocks) | CIDR blocks for the private subnets (one per AZ, used in security group rules) | `list(string)` | n/a | yes | +| [public\_subnets](#input\_public\_subnets) | Public subnet IDs (one per AZ) for NAT instance public ENIs | `list(string)` | n/a | yes | +| [tags](#input\_tags) | Additional tags to apply to all resources | `map(string)` | `{}` | no | +| [use\_fck\_nat\_ami](#input\_use\_fck\_nat\_ami) | Use the public fck-nat AMI. Set to false to use a custom AMI. | `bool` | `true` | no | +| [vpc\_id](#input\_vpc\_id) | The VPC ID where NAT instances will be deployed | `string` | n/a | yes | + +## Outputs + +| Name | Description | +|------|-------------| +| [eventbridge\_rule\_arn](#output\_eventbridge\_rule\_arn) | ARN of the EventBridge rule capturing EC2 state changes | +| [lambda\_function\_arn](#output\_lambda\_function\_arn) | ARN of the nat-zero Lambda function | +| [lambda\_function\_name](#output\_lambda\_function\_name) | Name of the nat-zero Lambda function | +| [launch\_template\_ids](#output\_launch\_template\_ids) | Launch template IDs for NAT instances (one per AZ) | +| [nat\_private\_eni\_ids](#output\_nat\_private\_eni\_ids) | Private ENI IDs for NAT instances (one per AZ) | +| [nat\_public\_eni\_ids](#output\_nat\_public\_eni\_ids) | Public ENI IDs for NAT instances (one per AZ) | +| [nat\_security\_group\_ids](#output\_nat\_security\_group\_ids) | Security group IDs for NAT instances (one per AZ) | diff --git a/docs/REFERENCE.md b/docs/REFERENCE.md new file mode 100644 index 0000000..0cd14ab --- /dev/null +++ b/docs/REFERENCE.md @@ -0,0 +1,86 @@ +## Requirements + +| Name | Version | +|------|---------| +| [terraform](#requirement\_terraform) | >= 1.3 | +| [aws](#requirement\_aws) | >= 5.0 | +| [null](#requirement\_null) | >= 3.0 | +| [time](#requirement\_time) | >= 0.9 | + +## Providers + +| Name | Version | +|------|---------| +| [aws](#provider\_aws) | >= 5.0 | +| [null](#provider\_null) | >= 3.0 | +| [time](#provider\_time) | >= 0.9 | + +## Modules + +No modules. + +## Resources + +| Name | Type | +|------|------| +| [aws_cloudwatch_event_rule.ec2_state_change](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_event_rule) | resource | +| [aws_cloudwatch_event_target.state_change_lambda_target](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_event_target) | resource | +| [aws_cloudwatch_log_group.nat_zero_logs](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_log_group) | resource | +| [aws_iam_instance_profile.nat_instance_profile](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_instance_profile) | resource | +| [aws_iam_role.lambda_iam_role](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role) | resource | +| [aws_iam_role.nat_instance_role](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role) | resource | +| [aws_iam_role_policy.lambda_iam_policy](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy) | resource | +| [aws_iam_role_policy_attachment.ssm_policy_attachment](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy_attachment) | resource | +| [aws_lambda_function.nat_zero](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_function) | resource | +| [aws_lambda_function_event_invoke_config.nat_zero_invoke_config](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_function_event_invoke_config) | resource | +| [aws_lambda_invocation.cleanup](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_invocation) | resource | +| [aws_lambda_permission.allow_ec2_state_change_eventbridge](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_permission) | resource | +| [aws_launch_template.nat_launch_template](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/launch_template) | resource | +| [aws_network_interface.nat_private_network_interface](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/network_interface) | resource | +| [aws_network_interface.nat_public_network_interface](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/network_interface) | resource | +| [aws_route.nat_route](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/route) | resource | +| [aws_security_group.nat_security_group](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/security_group) | resource | +| [null_resource.build_lambda](https://registry.terraform.io/providers/hashicorp/null/latest/docs/resources/resource) | resource | +| [null_resource.download_lambda](https://registry.terraform.io/providers/hashicorp/null/latest/docs/resources/resource) | resource | +| [time_sleep.lambda_ready](https://registry.terraform.io/providers/hashicorp/time/latest/docs/resources/sleep) | resource | + +## Inputs + +| Name | Description | Type | Default | Required | +|------|-------------|------|---------|:--------:| +| [ami\_id](#input\_ami\_id) | Explicit AMI ID to use (overrides AMI lookup entirely) | `string` | `null` | no | +| [availability\_zones](#input\_availability\_zones) | List of availability zones to deploy NAT instances in | `list(string)` | n/a | yes | +| [block\_device\_size](#input\_block\_device\_size) | Size in GB of the root EBS volume | `number` | `10` | no | +| [build\_lambda\_locally](#input\_build\_lambda\_locally) | Build the Lambda binary from Go source instead of downloading a pre-compiled release. Requires Go and zip installed locally. | `bool` | `false` | no | +| [custom\_ami\_name\_pattern](#input\_custom\_ami\_name\_pattern) | AMI name pattern when use\_fck\_nat\_ami is false | `string` | `null` | no | +| [custom\_ami\_owner](#input\_custom\_ami\_owner) | AMI owner account ID when use\_fck\_nat\_ami is false | `string` | `null` | no | +| [enable\_logging](#input\_enable\_logging) | Create a CloudWatch log group for the Lambda function | `bool` | `true` | no | +| [ignore\_tag\_key](#input\_ignore\_tag\_key) | Tag key used to mark instances the Lambda should ignore | `string` | `"nat-zero:ignore"` | no | +| [ignore\_tag\_value](#input\_ignore\_tag\_value) | Tag value used to mark instances the Lambda should ignore | `string` | `"true"` | no | +| [instance\_type](#input\_instance\_type) | Instance type for the NAT instance | `string` | `"t4g.nano"` | no | +| [lambda\_binary\_url](#input\_lambda\_binary\_url) | URL to the pre-compiled Go Lambda zip. Updated automatically by CI. | `string` | `"https://github.com/MachineDotDev/terraform-modules/releases/download/nat-zero-lambda-latest/lambda.zip"` | no | +| [lambda\_memory\_size](#input\_lambda\_memory\_size) | Memory allocated to the Lambda function in MB (also scales CPU proportionally) | `number` | `256` | no | +| [log\_retention\_days](#input\_log\_retention\_days) | CloudWatch log retention in days (only used when enable\_logging is true) | `number` | `14` | no | +| [market\_type](#input\_market\_type) | Whether to use spot or on-demand instances | `string` | `"on-demand"` | no | +| [name](#input\_name) | Name prefix for all resources created by this module | `string` | n/a | yes | +| [nat\_tag\_key](#input\_nat\_tag\_key) | Tag key used to identify NAT instances | `string` | `"nat-zero:managed"` | no | +| [nat\_tag\_value](#input\_nat\_tag\_value) | Tag value used to identify NAT instances | `string` | `"true"` | no | +| [private\_route\_table\_ids](#input\_private\_route\_table\_ids) | Route table IDs for the private subnets (one per AZ) | `list(string)` | n/a | yes | +| [private\_subnets](#input\_private\_subnets) | Private subnet IDs (one per AZ) for NAT instance private ENIs | `list(string)` | n/a | yes | +| [private\_subnets\_cidr\_blocks](#input\_private\_subnets\_cidr\_blocks) | CIDR blocks for the private subnets (one per AZ, used in security group rules) | `list(string)` | n/a | yes | +| [public\_subnets](#input\_public\_subnets) | Public subnet IDs (one per AZ) for NAT instance public ENIs | `list(string)` | n/a | yes | +| [tags](#input\_tags) | Additional tags to apply to all resources | `map(string)` | `{}` | no | +| [use\_fck\_nat\_ami](#input\_use\_fck\_nat\_ami) | Use the public fck-nat AMI. Set to false to use a custom AMI. | `bool` | `true` | no | +| [vpc\_id](#input\_vpc\_id) | The VPC ID where NAT instances will be deployed | `string` | n/a | yes | + +## Outputs + +| Name | Description | +|------|-------------| +| [eventbridge\_rule\_arn](#output\_eventbridge\_rule\_arn) | ARN of the EventBridge rule capturing EC2 state changes | +| [lambda\_function\_arn](#output\_lambda\_function\_arn) | ARN of the nat-zero Lambda function | +| [lambda\_function\_name](#output\_lambda\_function\_name) | Name of the nat-zero Lambda function | +| [launch\_template\_ids](#output\_launch\_template\_ids) | Launch template IDs for NAT instances (one per AZ) | +| [nat\_private\_eni\_ids](#output\_nat\_private\_eni\_ids) | Private ENI IDs for NAT instances (one per AZ) | +| [nat\_public\_eni\_ids](#output\_nat\_public\_eni\_ids) | Public ENI IDs for NAT instances (one per AZ) | +| [nat\_security\_group\_ids](#output\_nat\_security\_group\_ids) | Security group IDs for NAT instances (one per AZ) | diff --git a/docs/TESTING.md b/docs/TESTING.md new file mode 100644 index 0000000..790ad8b --- /dev/null +++ b/docs/TESTING.md @@ -0,0 +1,169 @@ +# Integration Tests + +The integration tests live in `tests/integration/` and use [Terratest](https://terratest.gruntwork.io/) (Go) to deploy real AWS infrastructure, exercise the Lambda, and tear it down. + +They run in CI via the `terratest` GitHub Actions job against `us-east-1`. + +## Test Fixture + +The Terraform fixture at `tests/integration/fixture/main.tf` creates: + +- A **private subnet** (`172.31.128.0/24`) in the account's default VPC +- A **route table** and association for that subnet +- The **nat_zero module** (`name = "nat-test"`) wired to the private subnet and a default public subnet + +All module resources (Lambda, EventBridge, ENIs, security groups, launch templates, IAM roles) are created inside the module. + +## TestNatZero + +A single test that exercises the full NAT lifecycle in four phases using subtests, with one `terraform apply` / `destroy` cycle. Each phase records wall-clock timing for the [TIMING SUMMARY](#timing-summary) printed at the end of the test. + +### Setup + +1. **Create workload IAM profile** — An IAM role/profile (`nat-test-wl-tt-`) is created that allows the workload instance to call `ec2:CreateTags` on itself. This lets the user-data script tag the instance with its egress IP. The profile is deferred for deletion at the end of the test. + +2. **Terraform apply** — Runs `terraform init` and `terraform apply` on the fixture. This creates the private subnet, route table, and the entire nat_zero module (Lambda, EventBridge rule, ENIs, security groups, launch template, IAM roles). `terraform destroy` is deferred for cleanup. + +3. **Read Terraform outputs** — Captures `vpc_id`, `private_subnet_id`, and `lambda_function_name` from the Terraform state. + +4. **Register cleanup handlers** — Defers workload instance termination and a Lambda log dumper that prints CloudWatch logs if the test fails. + +### Phase 1: NATCreationAndConnectivity + +Verifies the scale-up path: workload starts, NAT comes up with an EIP, workload reaches the internet through the NAT. + +1. **Launch workload instance** — Launches a `t4g.nano` EC2 instance in the private subnet with a user-data script. The script retries `curl https://checkip.amazonaws.com` every 2 seconds until the NAT provides internet, then tags the instance with `EgressIP=`. + +2. **Invoke Lambda** — Calls the Lambda with `{"instance_id": "", "state": "running"}`, bypassing EventBridge for reliability. This triggers `createNAT` (RunInstances). + +3. **Wait for NAT with EIP** — Polls every 2 seconds for a NAT instance that is running with an EIP on its public ENI (device index 0). The EIP is attached by a separate Lambda invocation triggered by the NAT's "running" EventBridge event. + +4. **Validate NAT configuration** — Asserts: + - NAT has the `nat-zero:managed=true` tag + - NAT has dual ENIs at device index 0 (public) and 1 (private) + - A `0.0.0.0/0` route exists pointing to the NAT's private ENI + +5. **Verify workload connectivity** — Polls for the workload's `EgressIP` tag. Asserts the egress IP matches the NAT's EIP. + +### Phase 2: NATScaleDown + +Verifies the scale-down path: workload terminates, NAT stops, EIP is released. + +1. **Terminate workload** — Terminates the Phase 1 workload and waits for termination. + +2. **Invoke Lambda (scale-down)** — Calls the Lambda with `{"instance_id": "", "state": "terminated"}`. This triggers `maybeStopNAT` → 3x sibling check → `stopNAT` (StopInstances). + +3. **Wait for NAT stopped** — Polls until the NAT reaches `stopped` state. + +4. **Invoke Lambda (detach EIP)** — Calls the Lambda with `{"instance_id": "", "state": "stopped"}` to simulate the EventBridge event. This triggers `detachEIP` → DisassociateAddress + ReleaseAddress. + +5. **Verify EIP released** — Polls until no EIPs tagged `nat-zero:managed=true` remain. + +### Phase 3: NATRestart + +Verifies the restart path: new workload starts, stopped NAT is restarted with a new EIP, workload gets connectivity. + +1. **Launch new workload** — New `t4g.nano` in the private subnet. + +2. **Invoke Lambda (restart)** — Calls the Lambda with `{"instance_id": "", "state": "running"}`. This triggers `ensureNAT` → finds stopped NAT → `startNAT` (StartInstances). + +3. **Wait for NAT with EIP** — Polls until the NAT is running with a new EIP (attached via EventBridge). + +4. **Verify connectivity** — Polls for the new workload's `EgressIP` tag and confirms internet access. + +### Phase 4: CleanupAction + +Verifies the destroy-time cleanup action works correctly. + +1. **Count EIPs** — Asserts at least one NAT EIP exists before cleanup. + +2. **Invoke cleanup** — Calls the Lambda with `{"action": "cleanup"}`. The Lambda terminates all NAT instances and releases all EIPs. + +3. **Verify resources cleaned** — Polls until no running NAT instances and no NAT EIPs remain. + +### Teardown (deferred, runs in LIFO order) + +1. Lambda log dump (only on failure) +2. Terminate test workload instances and wait +3. `terraform destroy` — removes all Terraform-managed resources +4. Delete workload IAM profile + +## Timing Summary + +The test prints a timing summary at the end showing wall-clock duration of each phase: + +``` +=== TIMING SUMMARY === + PHASE DURATION + ------------------------------------------------------------ + IAM profile creation 1.234s + Terraform init+apply 45.678s + Launch workload instance 0.890s + Lambda invoke (scale-up) 2.345s + Wait for NAT running with EIP 14.567s + Wait for workload egress IP 25.890s + Terminate workload instance 30.123s + Lambda invoke (scale-down) 5.456s + Wait for NAT stopped 45.678s + Lambda invoke (detach EIP) 1.234s + Wait for EIP released 2.345s + Launch workload instance (restart) 0.890s + Lambda invoke (restart) 0.567s + Wait for NAT restarted with EIP 12.345s + Wait for workload egress IP (restart) 20.123s + Lambda invoke (cleanup) 45.678s + Wait for NAT terminated 5.678s + Wait for EIPs released 1.234s + Terraform destroy 60.123s + ------------------------------------------------------------ + TOTAL 5m15.678s +=== END TIMING SUMMARY === +``` + +Key timings to watch: +- **Wait for NAT running with EIP**: How long from Lambda invocation to NAT with internet (cold create). Expect ~14 s. +- **Wait for NAT restarted with EIP**: Same metric for restart path. Expect ~12 s. +- **Lambda invoke (scale-down)**: Includes the 3x sibling retry (~4 s). Expect ~5 s. + +## TestNoOrphanedResources + +Runs after the main test. Searches for AWS resources with the `nat-test` prefix that were left behind by failed test runs. Checks for: + +- Subnet with test CIDR (`172.31.128.0/24`) +- ENIs, security groups, and launch templates named `nat-test-*` +- EventBridge rules named `nat-test-*` +- Lambda function `nat-test-nat-zero` +- CloudWatch log group `/aws/lambda/nat-test-*` +- IAM roles and instance profiles prefixed `nat-test` +- EIPs tagged `nat-zero:managed=true` + +If any are found, the test fails and lists them for manual cleanup. + +## Why the Cleanup Action Matters + +NAT instances and EIPs are created by the Lambda at runtime, not by Terraform. During `terraform destroy`, Terraform doesn't know these exist. Without the cleanup action: + +1. `terraform destroy` tries to delete ENIs +2. ENIs are still attached to running NAT instances +3. Deletion fails, leaving the entire stack half-destroyed + +The `aws_lambda_invocation.cleanup` resource invokes the Lambda with `{"action": "cleanup"}` during destroy, which terminates instances and releases EIPs before Terraform tries to remove ENIs and security groups. + +## Config Version Replacement + +The Lambda tracks a `CONFIG_VERSION` hash (derived from AMI, instance type, market type, and volume size). When a workload scales up and the existing NAT has an outdated `ConfigVersion` tag, the Lambda: + +1. Terminates the outdated NAT instance +2. Waits for the ENIs to become available +3. Creates a new NAT instance with the current config + +This ensures AMI or instance type changes propagate to NAT instances without manual intervention. + +## Running Locally + +```bash +cd nat-zero/tests/integration +go test -v -timeout 30m +``` + +Requires AWS credentials with permissions to create/destroy all resources in the fixture (EC2, IAM, Lambda, EventBridge, CloudWatch). diff --git a/eventbridge.tf b/eventbridge.tf new file mode 100644 index 0000000..2e11afa --- /dev/null +++ b/eventbridge.tf @@ -0,0 +1,39 @@ +# EventBridge rule for EC2 instance state change. +# These are interpreted by the nat-zero Lambda. +# One of these works across all AZ's. +resource "aws_cloudwatch_event_rule" "ec2_state_change" { + name = "${var.name}-ec2-state-changes" + description = "Capture EC2 state changes for nat-zero ${var.name}" + + event_pattern = jsonencode({ + source = ["aws.ec2"] + detail-type = ["EC2 Instance State-change Notification"] + detail = { + state = ["pending", "running", "stopping", "stopped", "shutting-down", "terminated"] + } + }) +} + +resource "aws_cloudwatch_event_target" "state_change_lambda_target" { + rule = aws_cloudwatch_event_rule.ec2_state_change.name + target_id = "${var.name}-ec2-state-change-lambda-target" + arn = aws_lambda_function.nat_zero.arn + + # Ensure EventBridge stops invoking the Lambda before the destroy-time + # cleanup invocation runs, preventing late invocations from recreating + # the CloudWatch log group after Terraform deletes it. + depends_on = [aws_lambda_invocation.cleanup] + + input_transformer { + input_paths = { + instance_id = "$.detail.instance-id" + state = "$.detail.state" + } + input_template = <, + "state": +} +EOF + } +} diff --git a/examples/basic/main.tf b/examples/basic/main.tf new file mode 100644 index 0000000..7910af8 --- /dev/null +++ b/examples/basic/main.tf @@ -0,0 +1,67 @@ +terraform { + required_version = ">= 1.3" + + required_providers { + aws = { + source = "hashicorp/aws" + version = ">= 5.0" + } + } +} + +provider "aws" { + region = "us-east-1" +} + +data "aws_availability_zones" "available" { + state = "available" +} + +locals { + azs = slice(data.aws_availability_zones.available.names, 0, 2) +} + +module "vpc" { + source = "terraform-aws-modules/vpc/aws" + version = "~> 5.0" + + name = "nat-zero-example" + cidr = "10.0.0.0/16" + + azs = local.azs + public_subnets = ["10.0.1.0/24", "10.0.2.0/24"] + private_subnets = ["10.0.101.0/24", "10.0.102.0/24"] + + # Do NOT enable NAT gateway -- this module replaces it + enable_nat_gateway = false +} + +module "nat_zero" { + source = "../../" + + name = "example-nat" + vpc_id = module.vpc.vpc_id + availability_zones = local.azs + public_subnets = module.vpc.public_subnets + private_subnets = module.vpc.private_subnets + + private_route_table_ids = module.vpc.private_route_table_ids + private_subnets_cidr_blocks = module.vpc.private_subnets_cidr_blocks + + # Defaults: t4g.nano, fck-nat AMI, on-demand + # Uncomment for spot instances: + # market_type = "spot" + + tags = { + Environment = "example" + ManagedBy = "terraform" + } +} + +output "lambda_function_name" { + value = module.nat_zero.lambda_function_name +} + +output "nat_security_group_ids" { + value = module.nat_zero.nat_security_group_ids +} diff --git a/iam.tf b/iam.tf new file mode 100644 index 0000000..0365f10 --- /dev/null +++ b/iam.tf @@ -0,0 +1,117 @@ +resource "aws_iam_role" "nat_instance_role" { + name_prefix = var.name + assume_role_policy = jsonencode({ + Version = "2012-10-17" + Statement = [{ + Effect = "Allow" + Principal = { Service = "ec2.amazonaws.com" } + Action = "sts:AssumeRole" + }] + }) + tags = local.common_tags +} + +resource "aws_iam_instance_profile" "nat_instance_profile" { + name_prefix = var.name + role = aws_iam_role.nat_instance_role.name + tags = local.common_tags +} + +resource "aws_iam_role_policy_attachment" "ssm_policy_attachment" { + policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore" + role = aws_iam_role.nat_instance_role.name +} + +resource "aws_iam_role" "lambda_iam_role" { + name = "${var.name}-Lambda-IAM-Role" + assume_role_policy = jsonencode({ + Version = "2012-10-17" + Statement = [{ + Effect = "Allow" + Principal = { Service = "lambda.amazonaws.com" } + Action = "sts:AssumeRole" + }] + }) + tags = local.common_tags +} + +resource "aws_iam_role_policy" "lambda_iam_policy" { + role = aws_iam_role.lambda_iam_role.name + name_prefix = var.name + + policy = jsonencode({ + Version = "2012-10-17" + Statement = concat([ + { + Sid = "EC2ReadOnly" + Effect = "Allow" + Action = [ + "ec2:DescribeInstances", + "ec2:DescribeImages", + "ec2:DescribeLaunchTemplates", + "ec2:DescribeLaunchTemplateVersions", + "ec2:DescribeNetworkInterfaces", + "ec2:DescribeAddresses", + ] + Resource = "*" + }, + { + Sid = "EC2RunInstances" + Effect = "Allow" + Action = [ + "ec2:RunInstances", + "ec2:CreateTags", + ] + Resource = "*" + }, + { + Sid = "EC2ManageNatInstances" + Effect = "Allow" + Action = [ + "ec2:StartInstances", + "ec2:StopInstances", + "ec2:TerminateInstances", + ] + Resource = "*" + Condition = { + StringEquals = { + "ec2:ResourceTag/${var.nat_tag_key}" = var.nat_tag_value + } + } + }, + { + Sid = "EIPManagement" + Effect = "Allow" + Action = [ + "ec2:AllocateAddress", + "ec2:ReleaseAddress", + "ec2:AssociateAddress", + "ec2:DisassociateAddress", + ] + Resource = "*" + }, + { + Sid = "PassRoleToNatInstance" + Effect = "Allow" + Action = "iam:PassRole" + Resource = aws_iam_role.nat_instance_role.arn + }, + ], var.enable_logging ? [{ + Sid = "CloudWatchLogs" + Effect = "Allow" + Action = [ + "logs:CreateLogStream", + "logs:PutLogEvents", + ] + Resource = "${aws_cloudwatch_log_group.nat_zero_logs[0].arn}:*" + }] : []) + }) +} + +resource "aws_lambda_permission" "allow_ec2_state_change_eventbridge" { + statement_id = "AllowExecutionFromEC2StateChangeEventBridge" + action = "lambda:InvokeFunction" + function_name = aws_lambda_function.nat_zero.function_name + principal = "events.amazonaws.com" + source_arn = aws_cloudwatch_event_rule.ec2_state_change.arn +} diff --git a/lambda.tf b/lambda.tf new file mode 100644 index 0000000..e37bd35 --- /dev/null +++ b/lambda.tf @@ -0,0 +1,92 @@ +resource "aws_cloudwatch_log_group" "nat_zero_logs" { + count = var.enable_logging ? 1 : 0 + name = "/aws/lambda/${var.name}-nat-zero" + retention_in_days = var.log_retention_days + tags = local.common_tags +} + +# create_duration: waits for IAM role propagation before Lambda is created. +# destroy_duration: when logging is enabled, waits for async CloudWatch log +# delivery to settle before the log group is deleted. +resource "time_sleep" "lambda_ready" { + depends_on = [ + aws_cloudwatch_log_group.nat_zero_logs, + aws_iam_role_policy.lambda_iam_policy, + ] + create_duration = "10s" + destroy_duration = var.enable_logging ? "10s" : "0s" +} + +resource "null_resource" "download_lambda" { + count = var.build_lambda_locally ? 0 : 1 + + triggers = { + url = var.lambda_binary_url + } + + provisioner "local-exec" { + command = "test -f ${path.module}/.build/lambda.zip || (mkdir -p ${path.module}/.build && curl -sfL -o ${path.module}/.build/lambda.zip ${var.lambda_binary_url})" + } +} + +resource "null_resource" "build_lambda" { + count = var.build_lambda_locally ? 1 : 0 + + triggers = { + source_hash = sha256(join("", [ + for f in sort(fileset("${path.module}/cmd/lambda", "*.go")) : + filesha256("${path.module}/cmd/lambda/${f}") + ])) + } + + provisioner "local-exec" { + command = <<-EOT + cd ${path.module}/cmd/lambda && \ + GOOS=linux GOARCH=arm64 CGO_ENABLED=0 go build -tags lambda.norpc -ldflags='-s -w' -o bootstrap && \ + zip lambda.zip bootstrap && \ + mkdir -p ../../.build && \ + cp lambda.zip ../../.build/lambda.zip && \ + rm bootstrap lambda.zip + EOT + } +} + +resource "aws_lambda_function" "nat_zero" { + filename = "${path.module}/.build/lambda.zip" + function_name = "${var.name}-nat-zero" + handler = "bootstrap" + role = aws_iam_role.lambda_iam_role.arn + runtime = "provided.al2023" + source_code_hash = fileexists("${path.module}/.build/lambda.zip") ? filebase64sha256("${path.module}/.build/lambda.zip") : null + architectures = ["arm64"] + timeout = 300 + memory_size = var.lambda_memory_size + tags = local.common_tags + + environment { + variables = { + NAT_TAG_KEY = var.nat_tag_key + NAT_TAG_VALUE = var.nat_tag_value + IGNORE_TAG_KEY = var.ignore_tag_key + IGNORE_TAG_VALUE = var.ignore_tag_value + TARGET_VPC_ID = var.vpc_id + AMI_OWNER_ACCOUNT = var.use_fck_nat_ami ? "568608671756" : var.custom_ami_owner + AMI_NAME_PATTERN = var.use_fck_nat_ami ? "fck-nat-al2023-*-arm64-*" : var.custom_ami_name_pattern + CONFIG_VERSION = sha256(join(",", [ + var.use_fck_nat_ami ? "568608671756" : var.custom_ami_owner, + var.use_fck_nat_ami ? "fck-nat-al2023-*-arm64-*" : var.custom_ami_name_pattern, + coalesce(var.ami_id, "none"), + var.instance_type, + var.market_type, + tostring(var.block_device_size), + ])) + } + } + + depends_on = [time_sleep.lambda_ready, null_resource.download_lambda, null_resource.build_lambda] +} + +resource "aws_lambda_function_event_invoke_config" "nat_zero_invoke_config" { + function_name = aws_lambda_function.nat_zero.function_name + maximum_retry_attempts = 2 +} diff --git a/launch_template.tf b/launch_template.tf new file mode 100644 index 0000000..cbf06ac --- /dev/null +++ b/launch_template.tf @@ -0,0 +1,79 @@ +locals { + common_tags = merge( + { + Name = var.name + }, + var.tags, + ) +} + +resource "aws_launch_template" "nat_launch_template" { + count = length(var.availability_zones) + name = "${var.name}-${var.availability_zones[count.index]}-launch-template" + instance_type = var.instance_type + image_id = var.ami_id + + iam_instance_profile { + arn = aws_iam_instance_profile.nat_instance_profile.arn + } + + block_device_mappings { + device_name = "/dev/xvda" + + ebs { + volume_size = var.block_device_size + volume_type = "gp3" + iops = 3000 + throughput = 250 + encrypted = true + } + } + + dynamic "instance_market_options" { + for_each = var.market_type == "spot" ? [1] : [] + content { + market_type = "spot" + spot_options { + spot_instance_type = "one-time" + instance_interruption_behavior = "terminate" + } + } + } + + metadata_options { + http_endpoint = "enabled" + http_tokens = "required" + } + + network_interfaces { + network_interface_id = aws_network_interface.nat_public_network_interface[count.index].id + device_index = 0 + delete_on_termination = false + } + + network_interfaces { + device_index = 1 + network_interface_id = aws_network_interface.nat_private_network_interface[count.index].id + delete_on_termination = false + } + + tag_specifications { + resource_type = "instance" + tags = merge( + local.common_tags, + { + (var.nat_tag_key) = var.nat_tag_value, + Name = "${var.name}-${var.availability_zones[count.index]}-nat-instance" + }, + ) + } + + description = "Launch template for NAT instance ${var.name} in ${var.availability_zones[count.index]}" + tags = merge( + { + AvailabilityZone = var.availability_zones[count.index], + VpcId = var.vpc_id, + }, + local.common_tags, + ) +} diff --git a/mkdocs.yml b/mkdocs.yml new file mode 100644 index 0000000..b008e3f --- /dev/null +++ b/mkdocs.yml @@ -0,0 +1,14 @@ +site_name: nat-zero +site_description: Scale-to-zero NAT instances for AWS +repo_url: https://github.com/MachineDotDev/nat-zero +theme: + name: material + palette: + scheme: default +nav: + - Home: INDEX.md + - Architecture: ARCHITECTURE.md + - Performance: PERFORMANCE.md + - Terraform Reference: REFERENCE.md + - Testing: TESTING.md + - Examples: EXAMPLES.md diff --git a/network.tf b/network.tf new file mode 100644 index 0000000..cbd9a09 --- /dev/null +++ b/network.tf @@ -0,0 +1,93 @@ +# The network configuration for the NAT instance +# Each of these resources is deployed one for each AZ, including EIPs, ENIs, and route table entries +resource "aws_security_group" "nat_security_group" { + count = length(var.availability_zones) + name_prefix = "${var.name}-${var.availability_zones[count.index]}-nat-sg" + vpc_id = var.vpc_id + description = "Security group for NAT instance ${var.name}" + + # Allow all traffic from private subnets (NAT must pass all protocols) + ingress { + from_port = 0 + to_port = 0 + protocol = "-1" + cidr_blocks = [var.private_subnets_cidr_blocks[count.index]] + } + + # Allow all outbound traffic to the internet + egress { + from_port = 0 + to_port = 0 + protocol = "-1" + cidr_blocks = ["0.0.0.0/0"] + } + + tags = merge( + local.common_tags, + { + Name = "${var.name}-${var.availability_zones[count.index]}-nat-instance-sg", + AZ = var.availability_zones[count.index], + }, + ) +} + +resource "aws_network_interface" "nat_public_network_interface" { + count = length(var.availability_zones) + subnet_id = var.public_subnets[count.index] + security_groups = [aws_security_group.nat_security_group[count.index].id] + source_dest_check = false + description = "Public ENI for NAT instance ${var.name} in ${var.availability_zones[count.index]}" + tags = merge( + local.common_tags, + { + Name = "${var.name}-${var.availability_zones[count.index]}-nat-public-eni" + }, + ) + depends_on = [aws_security_group.nat_security_group] +} + +resource "aws_network_interface" "nat_private_network_interface" { + count = length(var.availability_zones) + security_groups = [aws_security_group.nat_security_group[count.index].id] + subnet_id = var.private_subnets[count.index] + source_dest_check = false + description = "Private ENI for NAT instance ${var.name} in ${var.availability_zones[count.index]}" + tags = merge( + local.common_tags, + { + Name = "${var.name}-${var.availability_zones[count.index]}-nat-private-eni" + }, + ) + depends_on = [aws_security_group.nat_security_group] +} + +resource "aws_route" "nat_route" { + count = length(var.availability_zones) + route_table_id = var.private_route_table_ids[count.index] + destination_cidr_block = "0.0.0.0/0" + network_interface_id = aws_network_interface.nat_private_network_interface[count.index].id + depends_on = [aws_network_interface.nat_private_network_interface] +} + +# Cleanup Lambda-created NAT instances and EIPs on terraform destroy. +# These are not Terraform-managed, so they must be removed before the +# ENIs and security groups can be destroyed. +# lifecycle_scope "CRUD" invokes on both create (harmless no-op) and destroy. +# +# Destroy ordering: the cleanup invocation runs while the Lambda function, +# IAM permissions, and log group all still exist. Terraform then destroys +# the Lambda, waits (time_sleep), and finally removes the log group and +# IAM resources. This prevents the cleanup invocation from recreating a +# log group that was already destroyed. +resource "aws_lambda_invocation" "cleanup" { + function_name = aws_lambda_function.nat_zero.function_name + input = jsonencode({ action = "cleanup" }) + lifecycle_scope = "CRUD" + + depends_on = [ + aws_network_interface.nat_public_network_interface, + aws_network_interface.nat_private_network_interface, + aws_cloudwatch_log_group.nat_zero_logs, + aws_iam_role_policy.lambda_iam_policy, + ] +} diff --git a/outputs.tf b/outputs.tf new file mode 100644 index 0000000..4fade31 --- /dev/null +++ b/outputs.tf @@ -0,0 +1,34 @@ +output "lambda_function_arn" { + description = "ARN of the nat-zero Lambda function" + value = aws_lambda_function.nat_zero.arn +} + +output "lambda_function_name" { + description = "Name of the nat-zero Lambda function" + value = aws_lambda_function.nat_zero.function_name +} + +output "nat_security_group_ids" { + description = "Security group IDs for NAT instances (one per AZ)" + value = aws_security_group.nat_security_group[*].id +} + +output "nat_public_eni_ids" { + description = "Public ENI IDs for NAT instances (one per AZ)" + value = aws_network_interface.nat_public_network_interface[*].id +} + +output "nat_private_eni_ids" { + description = "Private ENI IDs for NAT instances (one per AZ)" + value = aws_network_interface.nat_private_network_interface[*].id +} + +output "launch_template_ids" { + description = "Launch template IDs for NAT instances (one per AZ)" + value = aws_launch_template.nat_launch_template[*].id +} + +output "eventbridge_rule_arn" { + description = "ARN of the EventBridge rule capturing EC2 state changes" + value = aws_cloudwatch_event_rule.ec2_state_change.arn +} diff --git a/release-please-config.json b/release-please-config.json new file mode 100644 index 0000000..1f107d5 --- /dev/null +++ b/release-please-config.json @@ -0,0 +1,15 @@ +{ + "$schema": "https://raw.githubusercontent.com/googleapis/release-please/main/schemas/config.json", + "packages": { + ".": { + "release-type": "terraform-module", + "changelog-sections": [ + { "type": "feat", "section": "Features" }, + { "type": "fix", "section": "Bug Fixes" }, + { "type": "perf", "section": "Performance" }, + { "type": "docs", "section": "Documentation" }, + { "type": "chore", "section": "Miscellaneous" } + ] + } + } +} diff --git a/tests/integration/fixture/main.tf b/tests/integration/fixture/main.tf new file mode 100644 index 0000000..2bb904e --- /dev/null +++ b/tests/integration/fixture/main.tf @@ -0,0 +1,95 @@ +terraform { + required_version = ">= 1.3" + + required_providers { + aws = { + source = "hashicorp/aws" + version = ">= 5.0" + } + } +} + +provider "aws" { + region = "us-east-1" +} + +# Use the default VPC and its subnets as public subnets — no VPC creation needed. +data "aws_vpc" "default" { + default = true +} + +data "aws_subnets" "default" { + filter { + name = "vpc-id" + values = [data.aws_vpc.default.id] + } + filter { + name = "default-for-az" + values = ["true"] + } +} + +data "aws_subnet" "public" { + id = data.aws_subnets.default.ids[0] +} + +# Only create a private subnet + route table — the minimum needed. +resource "aws_subnet" "private" { + vpc_id = data.aws_vpc.default.id + cidr_block = "172.31.128.0/24" + availability_zone = data.aws_subnet.public.availability_zone + + tags = { + Name = "nat-zero-test-private" + } +} + +resource "aws_route_table" "private" { + vpc_id = data.aws_vpc.default.id + + tags = { + Name = "nat-zero-test-private" + } +} + +resource "aws_route_table_association" "private" { + subnet_id = aws_subnet.private.id + route_table_id = aws_route_table.private.id +} + +variable "nat_instance_type" { + type = string + default = "t4g.nano" +} + +module "nat_zero" { + source = "../../../" + + name = "nat-test" + vpc_id = data.aws_vpc.default.id + availability_zones = [data.aws_subnet.public.availability_zone] + public_subnets = [data.aws_subnet.public.id] + private_subnets = [aws_subnet.private.id] + + private_route_table_ids = [aws_route_table.private.id] + private_subnets_cidr_blocks = [aws_subnet.private.cidr_block] + + instance_type = var.nat_instance_type + market_type = "on-demand" +} + +output "vpc_id" { + value = data.aws_vpc.default.id +} + +output "private_subnet_id" { + value = aws_subnet.private.id +} + +output "lambda_function_name" { + value = module.nat_zero.lambda_function_name +} + +output "nat_security_group_ids" { + value = module.nat_zero.nat_security_group_ids +} diff --git a/tests/integration/go.mod b/tests/integration/go.mod new file mode 100644 index 0000000..24738ad --- /dev/null +++ b/tests/integration/go.mod @@ -0,0 +1,59 @@ +module github.com/MachineDotDev/nat-zero/tests/integration + +go 1.22 + +require ( + github.com/aws/aws-sdk-go v1.55.5 + github.com/gruntwork-io/terratest v0.47.2 + github.com/stretchr/testify v1.9.0 +) + +require ( + cloud.google.com/go v0.110.0 // indirect + cloud.google.com/go/compute v1.19.1 // indirect + cloud.google.com/go/compute/metadata v0.2.3 // indirect + cloud.google.com/go/iam v0.13.0 // indirect + cloud.google.com/go/storage v1.28.1 // indirect + github.com/agext/levenshtein v1.2.3 // indirect + github.com/apparentlymart/go-textseg/v13 v13.0.0 // indirect + github.com/bgentry/go-netrc v0.0.0-20140422174119-9fd32a8b3d3d // indirect + github.com/davecgh/go-spew v1.1.1 // indirect + github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da // indirect + github.com/golang/protobuf v1.5.3 // indirect + github.com/google/go-cmp v0.5.9 // indirect + github.com/google/uuid v1.3.0 // indirect + github.com/googleapis/enterprise-certificate-proxy v0.2.3 // indirect + github.com/googleapis/gax-go/v2 v2.7.1 // indirect + github.com/hashicorp/errwrap v1.0.0 // indirect + github.com/hashicorp/go-cleanhttp v0.5.2 // indirect + github.com/hashicorp/go-getter v1.7.6 // indirect + github.com/hashicorp/go-multierror v1.1.0 // indirect + github.com/hashicorp/go-safetemp v1.0.0 // indirect + github.com/hashicorp/go-version v1.6.0 // indirect + github.com/hashicorp/hcl/v2 v2.9.1 // indirect + github.com/hashicorp/terraform-json v0.13.0 // indirect + github.com/jinzhu/copier v0.0.0-20190924061706-b57f9002281a // indirect + github.com/jmespath/go-jmespath v0.4.0 // indirect + github.com/klauspost/compress v1.15.11 // indirect + github.com/mattn/go-zglob v0.0.2-0.20190814121620-e3c945676326 // indirect + github.com/mitchellh/go-homedir v1.1.0 // indirect + github.com/mitchellh/go-testing-interface v1.14.1 // indirect + github.com/mitchellh/go-wordwrap v1.0.1 // indirect + github.com/pmezard/go-difflib v1.0.0 // indirect + github.com/tmccombs/hcl2json v0.3.3 // indirect + github.com/ulikunitz/xz v0.5.10 // indirect + github.com/zclconf/go-cty v1.9.1 // indirect + go.opencensus.io v0.24.0 // indirect + golang.org/x/crypto v0.21.0 // indirect + golang.org/x/net v0.23.0 // indirect + golang.org/x/oauth2 v0.8.0 // indirect + golang.org/x/sys v0.18.0 // indirect + golang.org/x/text v0.14.0 // indirect + golang.org/x/xerrors v0.0.0-20220907171357-04be3eba64a2 // indirect + google.golang.org/api v0.114.0 // indirect + google.golang.org/appengine v1.6.7 // indirect + google.golang.org/genproto v0.0.0-20230410155749-daa745c078e1 // indirect + google.golang.org/grpc v1.56.3 // indirect + google.golang.org/protobuf v1.33.0 // indirect + gopkg.in/yaml.v3 v3.0.1 // indirect +) diff --git a/tests/integration/go.sum b/tests/integration/go.sum new file mode 100644 index 0000000..be6e91f --- /dev/null +++ b/tests/integration/go.sum @@ -0,0 +1,974 @@ +cloud.google.com/go v0.26.0/go.mod h1:aQUYkXzVsufM+DwF1aE+0xfcU+56JwCaLick0ClmMTw= +cloud.google.com/go v0.34.0/go.mod h1:aQUYkXzVsufM+DwF1aE+0xfcU+56JwCaLick0ClmMTw= +cloud.google.com/go v0.38.0/go.mod h1:990N+gfupTy94rShfmMCWGDn0LpTmnzTp2qbd1dvSRU= +cloud.google.com/go v0.44.1/go.mod h1:iSa0KzasP4Uvy3f1mN/7PiObzGgflwredwwASm/v6AU= +cloud.google.com/go v0.44.2/go.mod h1:60680Gw3Yr4ikxnPRS/oxxkBccT6SA1yMk63TGekxKY= +cloud.google.com/go v0.45.1/go.mod h1:RpBamKRgapWJb87xiFSdk4g1CME7QZg3uwTez+TSTjc= +cloud.google.com/go v0.46.3/go.mod h1:a6bKKbmY7er1mI7TEI4lsAkts/mkhTSZK8w33B4RAg0= +cloud.google.com/go v0.50.0/go.mod h1:r9sluTvynVuxRIOHXQEHMFffphuXHOMZMycpNR5e6To= +cloud.google.com/go v0.52.0/go.mod h1:pXajvRH/6o3+F9jDHZWQ5PbGhn+o8w9qiu/CffaVdO4= +cloud.google.com/go v0.53.0/go.mod h1:fp/UouUEsRkN6ryDKNW/Upv/JBKnv6WDthjR6+vze6M= +cloud.google.com/go v0.54.0/go.mod h1:1rq2OEkV3YMf6n/9ZvGWI3GWw0VoqH/1x2nd8Is/bPc= +cloud.google.com/go v0.56.0/go.mod h1:jr7tqZxxKOVYizybht9+26Z/gUq7tiRzu+ACVAMbKVk= +cloud.google.com/go v0.57.0/go.mod h1:oXiQ6Rzq3RAkkY7N6t3TcE6jE+CIBBbA36lwQ1JyzZs= +cloud.google.com/go v0.62.0/go.mod h1:jmCYTdRCQuc1PHIIJ/maLInMho30T/Y0M4hTdTShOYc= +cloud.google.com/go v0.65.0/go.mod h1:O5N8zS7uWy9vkA9vayVHs65eM1ubvY4h553ofrNHObY= +cloud.google.com/go v0.72.0/go.mod h1:M+5Vjvlc2wnp6tjzE102Dw08nGShTscUx2nZMufOKPI= +cloud.google.com/go v0.74.0/go.mod h1:VV1xSbzvo+9QJOxLDaJfTjx5e+MePCpCWwvftOeQmWk= +cloud.google.com/go v0.78.0/go.mod h1:QjdrLG0uq+YwhjoVOLsS1t7TW8fs36kLs4XO5R5ECHg= +cloud.google.com/go v0.79.0/go.mod h1:3bzgcEeQlzbuEAYu4mrWhKqWjmpprinYgKJLgKHnbb8= +cloud.google.com/go v0.81.0/go.mod h1:mk/AM35KwGk/Nm2YSeZbxXdrNK3KZOYHmLkOqC2V6E0= +cloud.google.com/go v0.83.0/go.mod h1:Z7MJUsANfY0pYPdw0lbnivPx4/vhy/e2FEkSkF7vAVY= +cloud.google.com/go v0.84.0/go.mod h1:RazrYuxIK6Kb7YrzzhPoLmCVzl7Sup4NrbKPg8KHSUM= +cloud.google.com/go v0.87.0/go.mod h1:TpDYlFy7vuLzZMMZ+B6iRiELaY7z/gJPaqbMx6mlWcY= +cloud.google.com/go v0.90.0/go.mod h1:kRX0mNRHe0e2rC6oNakvwQqzyDmg57xJ+SZU1eT2aDQ= +cloud.google.com/go v0.93.3/go.mod h1:8utlLll2EF5XMAV15woO4lSbWQlk8rer9aLOfLh7+YI= +cloud.google.com/go v0.94.1/go.mod h1:qAlAugsXlC+JWO+Bke5vCtc9ONxjQT3drlTTnAplMW4= +cloud.google.com/go v0.97.0/go.mod h1:GF7l59pYBVlXQIBLx3a761cZ41F9bBH3JUlihCt2Udc= +cloud.google.com/go v0.99.0/go.mod h1:w0Xx2nLzqWJPuozYQX+hFfCSI8WioryfRDzkoI/Y2ZA= +cloud.google.com/go v0.100.2/go.mod h1:4Xra9TjzAeYHrl5+oeLlzbM2k3mjVhZh4UqTZ//w99A= +cloud.google.com/go v0.102.0/go.mod h1:oWcCzKlqJ5zgHQt9YsaeTY9KzIvjyy0ArmiBUgpQ+nc= +cloud.google.com/go v0.102.1/go.mod h1:XZ77E9qnTEnrgEOvr4xzfdX5TRo7fB4T2F4O6+34hIU= +cloud.google.com/go v0.104.0/go.mod h1:OO6xxXdJyvuJPcEPBLN9BJPD+jep5G1+2U5B5gkRYtA= +cloud.google.com/go v0.110.0 h1:Zc8gqp3+a9/Eyph2KDmcGaPtbKRIoqq4YTlL4NMD0Ys= +cloud.google.com/go v0.110.0/go.mod h1:SJnCLqQ0FCFGSZMUNUf84MV3Aia54kn7pi8st7tMzaY= +cloud.google.com/go/aiplatform v1.22.0/go.mod h1:ig5Nct50bZlzV6NvKaTwmplLLddFx0YReh9WfTO5jKw= +cloud.google.com/go/aiplatform v1.24.0/go.mod h1:67UUvRBKG6GTayHKV8DBv2RtR1t93YRu5B1P3x99mYY= +cloud.google.com/go/analytics v0.11.0/go.mod h1:DjEWCu41bVbYcKyvlws9Er60YE4a//bK6mnhWvQeFNI= +cloud.google.com/go/analytics v0.12.0/go.mod h1:gkfj9h6XRf9+TS4bmuhPEShsh3hH8PAZzm/41OOhQd4= +cloud.google.com/go/area120 v0.5.0/go.mod h1:DE/n4mp+iqVyvxHN41Vf1CR602GiHQjFPusMFW6bGR4= +cloud.google.com/go/area120 v0.6.0/go.mod h1:39yFJqWVgm0UZqWTOdqkLhjoC7uFfgXRC8g/ZegeAh0= +cloud.google.com/go/artifactregistry v1.6.0/go.mod h1:IYt0oBPSAGYj/kprzsBjZ/4LnG/zOcHyFHjWPCi6SAQ= +cloud.google.com/go/artifactregistry v1.7.0/go.mod h1:mqTOFOnGZx8EtSqK/ZWcsm/4U8B77rbcLP6ruDU2Ixk= +cloud.google.com/go/asset v1.5.0/go.mod h1:5mfs8UvcM5wHhqtSv8J1CtxxaQq3AdBxxQi2jGW/K4o= +cloud.google.com/go/asset v1.7.0/go.mod h1:YbENsRK4+xTiL+Ofoj5Ckf+O17kJtgp3Y3nn4uzZz5s= +cloud.google.com/go/asset v1.8.0/go.mod h1:mUNGKhiqIdbr8X7KNayoYvyc4HbbFO9URsjbytpUaW0= +cloud.google.com/go/assuredworkloads v1.5.0/go.mod h1:n8HOZ6pff6re5KYfBXcFvSViQjDwxFkAkmUFffJRbbY= +cloud.google.com/go/assuredworkloads v1.6.0/go.mod h1:yo2YOk37Yc89Rsd5QMVECvjaMKymF9OP+QXWlKXUkXw= +cloud.google.com/go/assuredworkloads v1.7.0/go.mod h1:z/736/oNmtGAyU47reJgGN+KVoYoxeLBoj4XkKYscNI= +cloud.google.com/go/automl v1.5.0/go.mod h1:34EjfoFGMZ5sgJ9EoLsRtdPSNZLcfflJR39VbVNS2M0= +cloud.google.com/go/automl v1.6.0/go.mod h1:ugf8a6Fx+zP0D59WLhqgTDsQI9w07o64uf/Is3Nh5p8= +cloud.google.com/go/bigquery v1.0.1/go.mod h1:i/xbL2UlR5RvWAURpBYZTtm/cXjCha9lbfbpx4poX+o= +cloud.google.com/go/bigquery v1.3.0/go.mod h1:PjpwJnslEMmckchkHFfq+HTD2DmtT67aNFKH1/VBDHE= +cloud.google.com/go/bigquery v1.4.0/go.mod h1:S8dzgnTigyfTmLBfrtrhyYhwRxG72rYxvftPBK2Dvzc= +cloud.google.com/go/bigquery v1.5.0/go.mod h1:snEHRnqQbz117VIFhE8bmtwIDY80NLUZUMb4Nv6dBIg= +cloud.google.com/go/bigquery v1.7.0/go.mod h1://okPTzCYNXSlb24MZs83e2Do+h+VXtc4gLoIoXIAPc= +cloud.google.com/go/bigquery v1.8.0/go.mod h1:J5hqkt3O0uAFnINi6JXValWIb1v0goeZM77hZzJN/fQ= +cloud.google.com/go/bigquery v1.42.0/go.mod h1:8dRTJxhtG+vwBKzE5OseQn/hiydoQN3EedCaOdYmxRA= +cloud.google.com/go/billing v1.4.0/go.mod h1:g9IdKBEFlItS8bTtlrZdVLWSSdSyFUZKXNS02zKMOZY= +cloud.google.com/go/billing v1.5.0/go.mod h1:mztb1tBc3QekhjSgmpf/CV4LzWXLzCArwpLmP2Gm88s= +cloud.google.com/go/binaryauthorization v1.1.0/go.mod h1:xwnoWu3Y84jbuHa0zd526MJYmtnVXn0syOjaJgy4+dM= +cloud.google.com/go/binaryauthorization v1.2.0/go.mod h1:86WKkJHtRcv5ViNABtYMhhNWRrD1Vpi//uKEy7aYEfI= +cloud.google.com/go/cloudtasks v1.5.0/go.mod h1:fD92REy1x5woxkKEkLdvavGnPJGEn8Uic9nWuLzqCpY= +cloud.google.com/go/cloudtasks v1.6.0/go.mod h1:C6Io+sxuke9/KNRkbQpihnW93SWDU3uXt92nu85HkYI= +cloud.google.com/go/compute v0.1.0/go.mod h1:GAesmwr110a34z04OlxYkATPBEfVhkymfTBXtfbBFow= +cloud.google.com/go/compute v1.3.0/go.mod h1:cCZiE1NHEtai4wiufUhW8I8S1JKkAnhnQJWM7YD99wM= +cloud.google.com/go/compute v1.5.0/go.mod h1:9SMHyhJlzhlkJqrPAc839t2BZFTSk6Jdj6mkzQJeu0M= +cloud.google.com/go/compute v1.6.0/go.mod h1:T29tfhtVbq1wvAPo0E3+7vhgmkOYeXjhFvz/FMzPu0s= +cloud.google.com/go/compute v1.6.1/go.mod h1:g85FgpzFvNULZ+S8AYq87axRKuf2Kh7deLqV/jJ3thU= +cloud.google.com/go/compute v1.7.0/go.mod h1:435lt8av5oL9P3fv1OEzSbSUe+ybHXGMPQHHZWZxy9U= +cloud.google.com/go/compute v1.10.0/go.mod h1:ER5CLbMxl90o2jtNbGSbtfOpQKR0t15FOtRsugnLrlU= +cloud.google.com/go/compute v1.19.1 h1:am86mquDUgjGNWxiGn+5PGLbmgiWXlE/yNWpIpNvuXY= +cloud.google.com/go/compute v1.19.1/go.mod h1:6ylj3a05WF8leseCdIf77NK0g1ey+nj5IKd5/kvShxE= +cloud.google.com/go/compute/metadata v0.2.3 h1:mg4jlk7mCAj6xXp9UJ4fjI9VUI5rubuGBW5aJ7UnBMY= +cloud.google.com/go/compute/metadata v0.2.3/go.mod h1:VAV5nSsACxMJvgaAuX6Pk2AawlZn8kiOGuCv6gTkwuA= +cloud.google.com/go/containeranalysis v0.5.1/go.mod h1:1D92jd8gRR/c0fGMlymRgxWD3Qw9C1ff6/T7mLgVL8I= +cloud.google.com/go/containeranalysis v0.6.0/go.mod h1:HEJoiEIu+lEXM+k7+qLCci0h33lX3ZqoYFdmPcoO7s4= +cloud.google.com/go/datacatalog v1.3.0/go.mod h1:g9svFY6tuR+j+hrTw3J2dNcmI0dzmSiyOzm8kpLq0a0= +cloud.google.com/go/datacatalog v1.5.0/go.mod h1:M7GPLNQeLfWqeIm3iuiruhPzkt65+Bx8dAKvScX8jvs= +cloud.google.com/go/datacatalog v1.6.0/go.mod h1:+aEyF8JKg+uXcIdAmmaMUmZ3q1b/lKLtXCmXdnc0lbc= +cloud.google.com/go/dataflow v0.6.0/go.mod h1:9QwV89cGoxjjSR9/r7eFDqqjtvbKxAK2BaYU6PVk9UM= +cloud.google.com/go/dataflow v0.7.0/go.mod h1:PX526vb4ijFMesO1o202EaUmouZKBpjHsTlCtB4parQ= +cloud.google.com/go/dataform v0.3.0/go.mod h1:cj8uNliRlHpa6L3yVhDOBrUXH+BPAO1+KFMQQNSThKo= +cloud.google.com/go/dataform v0.4.0/go.mod h1:fwV6Y4Ty2yIFL89huYlEkwUPtS7YZinZbzzj5S9FzCE= +cloud.google.com/go/datalabeling v0.5.0/go.mod h1:TGcJ0G2NzcsXSE/97yWjIZO0bXj0KbVlINXMG9ud42I= +cloud.google.com/go/datalabeling v0.6.0/go.mod h1:WqdISuk/+WIGeMkpw/1q7bK/tFEZxsrFJOJdY2bXvTQ= +cloud.google.com/go/dataqna v0.5.0/go.mod h1:90Hyk596ft3zUQ8NkFfvICSIfHFh1Bc7C4cK3vbhkeo= +cloud.google.com/go/dataqna v0.6.0/go.mod h1:1lqNpM7rqNLVgWBJyk5NF6Uen2PHym0jtVJonplVsDA= +cloud.google.com/go/datastore v1.0.0/go.mod h1:LXYbyblFSglQ5pkeyhO+Qmw7ukd3C+pD7TKLgZqpHYE= +cloud.google.com/go/datastore v1.1.0/go.mod h1:umbIZjpQpHh4hmRpGhH4tLFup+FVzqBi1b3c64qFpCk= +cloud.google.com/go/datastream v1.2.0/go.mod h1:i/uTP8/fZwgATHS/XFu0TcNUhuA0twZxxQ3EyCUQMwo= +cloud.google.com/go/datastream v1.3.0/go.mod h1:cqlOX8xlyYF/uxhiKn6Hbv6WjwPPuI9W2M9SAXwaLLQ= +cloud.google.com/go/dialogflow v1.15.0/go.mod h1:HbHDWs33WOGJgn6rfzBW1Kv807BE3O1+xGbn59zZWI4= +cloud.google.com/go/dialogflow v1.16.1/go.mod h1:po6LlzGfK+smoSmTBnbkIZY2w8ffjz/RcGSS+sh1el0= +cloud.google.com/go/dialogflow v1.17.0/go.mod h1:YNP09C/kXA1aZdBgC/VtXX74G/TKn7XVCcVumTflA+8= +cloud.google.com/go/documentai v1.7.0/go.mod h1:lJvftZB5NRiFSX4moiye1SMxHx0Bc3x1+p9e/RfXYiU= +cloud.google.com/go/documentai v1.8.0/go.mod h1:xGHNEB7CtsnySCNrCFdCyyMz44RhFEEX2Q7UD0c5IhU= +cloud.google.com/go/domains v0.6.0/go.mod h1:T9Rz3GasrpYk6mEGHh4rymIhjlnIuB4ofT1wTxDeT4Y= +cloud.google.com/go/domains v0.7.0/go.mod h1:PtZeqS1xjnXuRPKE/88Iru/LdfoRyEHYA9nFQf4UKpg= +cloud.google.com/go/edgecontainer v0.1.0/go.mod h1:WgkZ9tp10bFxqO8BLPqv2LlfmQF1X8lZqwW4r1BTajk= +cloud.google.com/go/edgecontainer v0.2.0/go.mod h1:RTmLijy+lGpQ7BXuTDa4C4ssxyXT34NIuHIgKuP4s5w= +cloud.google.com/go/functions v1.6.0/go.mod h1:3H1UA3qiIPRWD7PeZKLvHZ9SaQhR26XIJcC0A5GbvAk= +cloud.google.com/go/functions v1.7.0/go.mod h1:+d+QBcWM+RsrgZfV9xo6KfA1GlzJfxcfZcRPEhDDfzg= +cloud.google.com/go/gaming v1.5.0/go.mod h1:ol7rGcxP/qHTRQE/RO4bxkXq+Fix0j6D4LFPzYTIrDM= +cloud.google.com/go/gaming v1.6.0/go.mod h1:YMU1GEvA39Qt3zWGyAVA9bpYz/yAhTvaQ1t2sK4KPUA= +cloud.google.com/go/gkeconnect v0.5.0/go.mod h1:c5lsNAg5EwAy7fkqX/+goqFsU1Da/jQFqArp+wGNr/o= +cloud.google.com/go/gkeconnect v0.6.0/go.mod h1:Mln67KyU/sHJEBY8kFZ0xTeyPtzbq9StAVvEULYK16A= +cloud.google.com/go/gkehub v0.9.0/go.mod h1:WYHN6WG8w9bXU0hqNxt8rm5uxnk8IH+lPY9J2TV7BK0= +cloud.google.com/go/gkehub v0.10.0/go.mod h1:UIPwxI0DsrpsVoWpLB0stwKCP+WFVG9+y977wO+hBH0= +cloud.google.com/go/grafeas v0.2.0/go.mod h1:KhxgtF2hb0P191HlY5besjYm6MqTSTj3LSI+M+ByZHc= +cloud.google.com/go/iam v0.3.0/go.mod h1:XzJPvDayI+9zsASAFO68Hk07u3z+f+JrT2xXNdp4bnY= +cloud.google.com/go/iam v0.5.0/go.mod h1:wPU9Vt0P4UmCux7mqtRu6jcpPAb74cP1fh50J3QpkUc= +cloud.google.com/go/iam v0.13.0 h1:+CmB+K0J/33d0zSQ9SlFWUeCCEn5XJA0ZMZ3pHE9u8k= +cloud.google.com/go/iam v0.13.0/go.mod h1:ljOg+rcNfzZ5d6f1nAUJ8ZIxOaZUVoS14bKCtaLZ/D0= +cloud.google.com/go/language v1.4.0/go.mod h1:F9dRpNFQmJbkaop6g0JhSBXCNlO90e1KWx5iDdxbWic= +cloud.google.com/go/language v1.6.0/go.mod h1:6dJ8t3B+lUYfStgls25GusK04NLh3eDLQnWM3mdEbhI= +cloud.google.com/go/lifesciences v0.5.0/go.mod h1:3oIKy8ycWGPUyZDR/8RNnTOYevhaMLqh5vLUXs9zvT8= +cloud.google.com/go/lifesciences v0.6.0/go.mod h1:ddj6tSX/7BOnhxCSd3ZcETvtNr8NZ6t/iPhY2Tyfu08= +cloud.google.com/go/longrunning v0.4.1 h1:v+yFJOfKC3yZdY6ZUI933pIYdhyhV8S3NpWrXWmg7jM= +cloud.google.com/go/longrunning v0.4.1/go.mod h1:4iWDqhBZ70CvZ6BfETbvam3T8FMvLK+eFj0E6AaRQTo= +cloud.google.com/go/mediatranslation v0.5.0/go.mod h1:jGPUhGTybqsPQn91pNXw0xVHfuJ3leR1wj37oU3y1f4= +cloud.google.com/go/mediatranslation v0.6.0/go.mod h1:hHdBCTYNigsBxshbznuIMFNe5QXEowAuNmmC7h8pu5w= +cloud.google.com/go/memcache v1.4.0/go.mod h1:rTOfiGZtJX1AaFUrOgsMHX5kAzaTQ8azHiuDoTPzNsE= +cloud.google.com/go/memcache v1.5.0/go.mod h1:dk3fCK7dVo0cUU2c36jKb4VqKPS22BTkf81Xq617aWM= +cloud.google.com/go/metastore v1.5.0/go.mod h1:2ZNrDcQwghfdtCwJ33nM0+GrBGlVuh8rakL3vdPY3XY= +cloud.google.com/go/metastore v1.6.0/go.mod h1:6cyQTls8CWXzk45G55x57DVQ9gWg7RiH65+YgPsNh9s= +cloud.google.com/go/networkconnectivity v1.4.0/go.mod h1:nOl7YL8odKyAOtzNX73/M5/mGZgqqMeryi6UPZTk/rA= +cloud.google.com/go/networkconnectivity v1.5.0/go.mod h1:3GzqJx7uhtlM3kln0+x5wyFvuVH1pIBJjhCpjzSt75o= +cloud.google.com/go/networksecurity v0.5.0/go.mod h1:xS6fOCoqpVC5zx15Z/MqkfDwH4+m/61A3ODiDV1xmiQ= +cloud.google.com/go/networksecurity v0.6.0/go.mod h1:Q5fjhTr9WMI5mbpRYEbiexTzROf7ZbDzvzCrNl14nyU= +cloud.google.com/go/notebooks v1.2.0/go.mod h1:9+wtppMfVPUeJ8fIWPOq1UnATHISkGXGqTkxeieQ6UY= +cloud.google.com/go/notebooks v1.3.0/go.mod h1:bFR5lj07DtCPC7YAAJ//vHskFBxA5JzYlH68kXVdk34= +cloud.google.com/go/osconfig v1.7.0/go.mod h1:oVHeCeZELfJP7XLxcBGTMBvRO+1nQ5tFG9VQTmYS2Fs= +cloud.google.com/go/osconfig v1.8.0/go.mod h1:EQqZLu5w5XA7eKizepumcvWx+m8mJUhEwiPqWiZeEdg= +cloud.google.com/go/oslogin v1.4.0/go.mod h1:YdgMXWRaElXz/lDk1Na6Fh5orF7gvmJ0FGLIs9LId4E= +cloud.google.com/go/oslogin v1.5.0/go.mod h1:D260Qj11W2qx/HVF29zBg+0fd6YCSjSqLUkY/qEenQU= +cloud.google.com/go/phishingprotection v0.5.0/go.mod h1:Y3HZknsK9bc9dMi+oE8Bim0lczMU6hrX0UpADuMefr0= +cloud.google.com/go/phishingprotection v0.6.0/go.mod h1:9Y3LBLgy0kDTcYET8ZH3bq/7qni15yVUoAxiFxnlSUA= +cloud.google.com/go/privatecatalog v0.5.0/go.mod h1:XgosMUvvPyxDjAVNDYxJ7wBW8//hLDDYmnsNcMGq1K0= +cloud.google.com/go/privatecatalog v0.6.0/go.mod h1:i/fbkZR0hLN29eEWiiwue8Pb+GforiEIBnV9yrRUOKI= +cloud.google.com/go/pubsub v1.0.1/go.mod h1:R0Gpsv3s54REJCy4fxDixWD93lHJMoZTyQ2kNxGRt3I= +cloud.google.com/go/pubsub v1.1.0/go.mod h1:EwwdRX2sKPjnvnqCa270oGRyludottCI76h+R3AArQw= +cloud.google.com/go/pubsub v1.2.0/go.mod h1:jhfEVHT8odbXTkndysNHCcx0awwzvfOlguIAii9o8iA= +cloud.google.com/go/pubsub v1.3.1/go.mod h1:i+ucay31+CNRpDW4Lu78I4xXG+O1r/MAHgjpRVR+TSU= +cloud.google.com/go/recaptchaenterprise v1.3.1/go.mod h1:OdD+q+y4XGeAlxRaMn1Y7/GveP6zmq76byL6tjPE7d4= +cloud.google.com/go/recaptchaenterprise/v2 v2.1.0/go.mod h1:w9yVqajwroDNTfGuhmOjPDN//rZGySaf6PtFVcSCa7o= +cloud.google.com/go/recaptchaenterprise/v2 v2.2.0/go.mod h1:/Zu5jisWGeERrd5HnlS3EUGb/D335f9k51B/FVil0jk= +cloud.google.com/go/recaptchaenterprise/v2 v2.3.0/go.mod h1:O9LwGCjrhGHBQET5CA7dd5NwwNQUErSgEDit1DLNTdo= +cloud.google.com/go/recommendationengine v0.5.0/go.mod h1:E5756pJcVFeVgaQv3WNpImkFP8a+RptV6dDLGPILjvg= +cloud.google.com/go/recommendationengine v0.6.0/go.mod h1:08mq2umu9oIqc7tDy8sx+MNJdLG0fUi3vaSVbztHgJ4= +cloud.google.com/go/recommender v1.5.0/go.mod h1:jdoeiBIVrJe9gQjwd759ecLJbxCDED4A6p+mqoqDvTg= +cloud.google.com/go/recommender v1.6.0/go.mod h1:+yETpm25mcoiECKh9DEScGzIRyDKpZ0cEhWGo+8bo+c= +cloud.google.com/go/redis v1.7.0/go.mod h1:V3x5Jq1jzUcg+UNsRvdmsfuFnit1cfe3Z/PGyq/lm4Y= +cloud.google.com/go/redis v1.8.0/go.mod h1:Fm2szCDavWzBk2cDKxrkmWBqoCiL1+Ctwq7EyqBCA/A= +cloud.google.com/go/retail v1.8.0/go.mod h1:QblKS8waDmNUhghY2TI9O3JLlFk8jybHeV4BF19FrE4= +cloud.google.com/go/retail v1.9.0/go.mod h1:g6jb6mKuCS1QKnH/dpu7isX253absFl6iE92nHwlBUY= +cloud.google.com/go/scheduler v1.4.0/go.mod h1:drcJBmxF3aqZJRhmkHQ9b3uSSpQoltBPGPxGAWROx6s= +cloud.google.com/go/scheduler v1.5.0/go.mod h1:ri073ym49NW3AfT6DZi21vLZrG07GXr5p3H1KxN5QlI= +cloud.google.com/go/secretmanager v1.6.0/go.mod h1:awVa/OXF6IiyaU1wQ34inzQNc4ISIDIrId8qE5QGgKA= +cloud.google.com/go/security v1.5.0/go.mod h1:lgxGdyOKKjHL4YG3/YwIL2zLqMFCKs0UbQwgyZmfJl4= +cloud.google.com/go/security v1.7.0/go.mod h1:mZklORHl6Bg7CNnnjLH//0UlAlaXqiG7Lb9PsPXLfD0= +cloud.google.com/go/security v1.8.0/go.mod h1:hAQOwgmaHhztFhiQ41CjDODdWP0+AE1B3sX4OFlq+GU= +cloud.google.com/go/securitycenter v1.13.0/go.mod h1:cv5qNAqjY84FCN6Y9z28WlkKXyWsgLO832YiWwkCWcU= +cloud.google.com/go/securitycenter v1.14.0/go.mod h1:gZLAhtyKv85n52XYWt6RmeBdydyxfPeTrpToDPw4Auc= +cloud.google.com/go/servicedirectory v1.4.0/go.mod h1:gH1MUaZCgtP7qQiI+F+A+OpeKF/HQWgtAddhTbhL2bs= +cloud.google.com/go/servicedirectory v1.5.0/go.mod h1:QMKFL0NUySbpZJ1UZs3oFAmdvVxhhxB6eJ/Vlp73dfg= +cloud.google.com/go/speech v1.6.0/go.mod h1:79tcr4FHCimOp56lwC01xnt/WPJZc4v3gzyT7FoBkCM= +cloud.google.com/go/speech v1.7.0/go.mod h1:KptqL+BAQIhMsj1kOP2la5DSEEerPDuOP/2mmkhHhZQ= +cloud.google.com/go/storage v1.0.0/go.mod h1:IhtSnM/ZTZV8YYJWCY8RULGVqBDmpoyjwiyrjsg+URw= +cloud.google.com/go/storage v1.5.0/go.mod h1:tpKbwo567HUNpVclU5sGELwQWBDZ8gh0ZeosJ0Rtdos= +cloud.google.com/go/storage v1.6.0/go.mod h1:N7U0C8pVQ/+NIKOBQyamJIeKQKkZ+mxpohlUTyfDhBk= +cloud.google.com/go/storage v1.8.0/go.mod h1:Wv1Oy7z6Yz3DshWRJFhqM/UCfaWIRTdp0RXyy7KQOVs= +cloud.google.com/go/storage v1.10.0/go.mod h1:FLPqc6j+Ki4BU591ie1oL6qBQGu2Bl/tZ9ullr3+Kg0= +cloud.google.com/go/storage v1.22.1/go.mod h1:S8N1cAStu7BOeFfE8KAQzmyyLkK8p/vmRq6kuBTW58Y= +cloud.google.com/go/storage v1.23.0/go.mod h1:vOEEDNFnciUMhBeT6hsJIn3ieU5cFRmzeLgDvXzfIXc= +cloud.google.com/go/storage v1.27.0/go.mod h1:x9DOL8TK/ygDUMieqwfhdpQryTeEkhGKMi80i/iqR2s= +cloud.google.com/go/storage v1.28.1 h1:F5QDG5ChchaAVQhINh24U99OWHURqrW8OmQcGKXcbgI= +cloud.google.com/go/storage v1.28.1/go.mod h1:Qnisd4CqDdo6BGs2AD5LLnEsmSQ80wQ5ogcBBKhU86Y= +cloud.google.com/go/talent v1.1.0/go.mod h1:Vl4pt9jiHKvOgF9KoZo6Kob9oV4lwd/ZD5Cto54zDRw= +cloud.google.com/go/talent v1.2.0/go.mod h1:MoNF9bhFQbiJ6eFD3uSsg0uBALw4n4gaCaEjBw9zo8g= +cloud.google.com/go/videointelligence v1.6.0/go.mod h1:w0DIDlVRKtwPCn/C4iwZIJdvC69yInhW0cfi+p546uU= +cloud.google.com/go/videointelligence v1.7.0/go.mod h1:k8pI/1wAhjznARtVT9U1llUaFNPh7muw8QyOUpavru4= +cloud.google.com/go/vision v1.2.0/go.mod h1:SmNwgObm5DpFBme2xpyOyasvBc1aPdjvMk2bBk0tKD0= +cloud.google.com/go/vision/v2 v2.2.0/go.mod h1:uCdV4PpN1S0jyCyq8sIM42v2Y6zOLkZs+4R9LrGYwFo= +cloud.google.com/go/vision/v2 v2.3.0/go.mod h1:UO61abBx9QRMFkNBbf1D8B1LXdS2cGiiCRx0vSpZoUo= +cloud.google.com/go/webrisk v1.4.0/go.mod h1:Hn8X6Zr+ziE2aNd8SliSDWpEnSS1u4R9+xXZmFiHmGE= +cloud.google.com/go/webrisk v1.5.0/go.mod h1:iPG6fr52Tv7sGk0H6qUFzmL3HHZev1htXuWDEEsqMTg= +cloud.google.com/go/workflows v1.6.0/go.mod h1:6t9F5h/unJz41YqfBmqSASJSXccBLtD1Vwf+KmJENM0= +cloud.google.com/go/workflows v1.7.0/go.mod h1:JhSrZuVZWuiDfKEFxU0/F1PQjmpnpcoISEXH2bcHC3M= +dmitri.shuralyov.com/gpu/mtl v0.0.0-20190408044501-666a987793e9/go.mod h1:H6x//7gZCb22OMCxBHrMx7a5I7Hp++hsVxbQ4BYO7hU= +github.com/BurntSushi/toml v0.3.1/go.mod h1:xHWCNGjB5oqiDr8zfno3MHue2Ht5sIBksp03qcyfWMU= +github.com/BurntSushi/xgb v0.0.0-20160522181843-27f122750802/go.mod h1:IVnqGOEym/WlBOVXweHU+Q+/VP0lqqI8lqeDx9IjBqo= +github.com/OneOfOne/xxhash v1.2.2/go.mod h1:HSdplMjZKSmBqAxg5vPj2TmRDmfkzw+cTzAElWljhcU= +github.com/agext/levenshtein v1.2.1/go.mod h1:JEDfjyjHDjOF/1e4FlBE/PkbqA9OfWu2ki2W0IB5558= +github.com/agext/levenshtein v1.2.3 h1:YB2fHEn0UJagG8T1rrWknE3ZQzWM06O8AMAatNn7lmo= +github.com/agext/levenshtein v1.2.3/go.mod h1:JEDfjyjHDjOF/1e4FlBE/PkbqA9OfWu2ki2W0IB5558= +github.com/antihax/optional v1.0.0/go.mod h1:uupD/76wgC+ih3iEmQUL+0Ugr19nfwCT1kdvxnR2qWY= +github.com/apparentlymart/go-dump v0.0.0-20180507223929-23540a00eaa3/go.mod h1:oL81AME2rN47vu18xqj1S1jPIPuN7afo62yKTNn3XMM= +github.com/apparentlymart/go-textseg v1.0.0/go.mod h1:z96Txxhf3xSFMPmb5X/1W05FF/Nj9VFpLOpjS5yuumk= +github.com/apparentlymart/go-textseg/v13 v13.0.0 h1:Y+KvPE1NYz0xl601PVImeQfFyEy6iT90AvPUL1NNfNw= +github.com/apparentlymart/go-textseg/v13 v13.0.0/go.mod h1:ZK2fH7c4NqDTLtiYLvIkEghdlcqw7yxLeM89kiTRPUo= +github.com/aws/aws-sdk-go v1.44.122/go.mod h1:y4AeaBuwd2Lk+GepC1E9v0qOiTws0MIWAX4oIKwKHZo= +github.com/aws/aws-sdk-go v1.55.5 h1:KKUZBfBoyqy5d3swXyiC7Q76ic40rYcbqH7qjh59kzU= +github.com/aws/aws-sdk-go v1.55.5/go.mod h1:eRwEWoyTWFMVYVQzKMNHWP5/RV4xIUGMQfXQHfHkpNU= +github.com/bgentry/go-netrc v0.0.0-20140422174119-9fd32a8b3d3d h1:xDfNPAt8lFiC1UJrqV3uuy861HCTo708pDMbjHHdCas= +github.com/bgentry/go-netrc v0.0.0-20140422174119-9fd32a8b3d3d/go.mod h1:6QX/PXZ00z/TKoufEY6K/a0k6AhaJrQKdFe6OfVXsa4= +github.com/census-instrumentation/opencensus-proto v0.2.1/go.mod h1:f6KPmirojxKA12rnyqOA5BBL4O983OfeGPqjHWSTneU= +github.com/cespare/xxhash v1.1.0/go.mod h1:XrSqR1VqqWfGrhpAt58auRo0WTKS1nRRg3ghfAqPWnc= +github.com/cespare/xxhash/v2 v2.1.1/go.mod h1:VGX0DQ3Q6kWi7AoAeZDth3/j3BFtOZR5XLFGgcrjCOs= +github.com/cheggaaa/pb v1.0.27/go.mod h1:pQciLPpbU0oxA0h+VJYYLxO+XeDQb5pZijXscXHm81s= +github.com/chzyer/logex v1.1.10/go.mod h1:+Ywpsq7O8HXn0nuIou7OrIPyXbp3wmkHB+jjWRnGsAI= +github.com/chzyer/readline v0.0.0-20180603132655-2972be24d48e/go.mod h1:nSuG5e5PlCu98SY8svDHJxuZscDgtXS6KTTbou5AhLI= +github.com/chzyer/test v0.0.0-20180213035817-a1ea475d72b1/go.mod h1:Q3SI9o4m/ZMnBNeIyt5eFwwo7qiLfzFZmjNmxjkiQlU= +github.com/client9/misspell v0.3.4/go.mod h1:qj6jICC3Q7zFZvVWo7KLAzC3yx5G7kyvSDkc90ppPyw= +github.com/cncf/udpa/go v0.0.0-20191209042840-269d4d468f6f/go.mod h1:M8M6+tZqaGXZJjfX53e64911xZQV5JYwmTeXPW+k8Sc= +github.com/cncf/udpa/go v0.0.0-20200629203442-efcf912fb354/go.mod h1:WmhPx2Nbnhtbo57+VJT5O0JRkEi1Wbu0z5j0R8u5Hbk= +github.com/cncf/udpa/go v0.0.0-20201120205902-5459f2c99403/go.mod h1:WmhPx2Nbnhtbo57+VJT5O0JRkEi1Wbu0z5j0R8u5Hbk= +github.com/cncf/udpa/go v0.0.0-20210930031921-04548b0d99d4/go.mod h1:6pvJx4me5XPnfI9Z40ddWsdw2W/uZgQLFXToKeRcDiI= +github.com/cncf/xds/go v0.0.0-20210312221358-fbca930ec8ed/go.mod h1:eXthEFrGJvWHgFFCl3hGmgk+/aYT6PnTQLykKQRLhEs= +github.com/cncf/xds/go v0.0.0-20210805033703-aa0b78936158/go.mod h1:eXthEFrGJvWHgFFCl3hGmgk+/aYT6PnTQLykKQRLhEs= +github.com/cncf/xds/go v0.0.0-20210922020428-25de7278fc84/go.mod h1:eXthEFrGJvWHgFFCl3hGmgk+/aYT6PnTQLykKQRLhEs= +github.com/cncf/xds/go v0.0.0-20211001041855-01bcc9b48dfe/go.mod h1:eXthEFrGJvWHgFFCl3hGmgk+/aYT6PnTQLykKQRLhEs= +github.com/cncf/xds/go v0.0.0-20211011173535-cb28da3451f1/go.mod h1:eXthEFrGJvWHgFFCl3hGmgk+/aYT6PnTQLykKQRLhEs= +github.com/creack/pty v1.1.9/go.mod h1:oKZEueFk5CKHvIhNR5MUki03XCEU+Q6VDXinZuGJ33E= +github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= +github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c= +github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= +github.com/envoyproxy/go-control-plane v0.9.0/go.mod h1:YTl/9mNaCwkRvm6d1a2C3ymFceY/DCBVvsKhRF0iEA4= +github.com/envoyproxy/go-control-plane v0.9.1-0.20191026205805-5f8ba28d4473/go.mod h1:YTl/9mNaCwkRvm6d1a2C3ymFceY/DCBVvsKhRF0iEA4= +github.com/envoyproxy/go-control-plane v0.9.4/go.mod h1:6rpuAdCZL397s3pYoYcLgu1mIlRU8Am5FuJP05cCM98= +github.com/envoyproxy/go-control-plane v0.9.7/go.mod h1:cwu0lG7PUMfa9snN8LXBig5ynNVH9qI8YYLbd1fK2po= +github.com/envoyproxy/go-control-plane v0.9.9-0.20201210154907-fd9021fe5dad/go.mod h1:cXg6YxExXjJnVBQHBLXeUAgxn2UodCpnH306RInaBQk= +github.com/envoyproxy/go-control-plane v0.9.9-0.20210217033140-668b12f5399d/go.mod h1:cXg6YxExXjJnVBQHBLXeUAgxn2UodCpnH306RInaBQk= +github.com/envoyproxy/go-control-plane v0.9.9-0.20210512163311-63b5d3c536b0/go.mod h1:hliV/p42l8fGbc6Y9bQ70uLwIvmJyVE5k4iMKlh8wCQ= +github.com/envoyproxy/go-control-plane v0.9.10-0.20210907150352-cf90f659a021/go.mod h1:AFq3mo9L8Lqqiid3OhADV3RfLJnjiw63cSpi+fDTRC0= +github.com/envoyproxy/go-control-plane v0.10.2-0.20220325020618-49ff273808a1/go.mod h1:KJwIaB5Mv44NWtYuAOFCVOjcI94vtpEz2JU/D2v6IjE= +github.com/envoyproxy/protoc-gen-validate v0.1.0/go.mod h1:iSmxcyjqTsJpI2R4NaDN7+kN2VEUnK/pcBlmesArF7c= +github.com/fatih/color v1.7.0/go.mod h1:Zm6kSWBoL9eyXnKyktHP6abPY2pDugNf5KwzbycvMj4= +github.com/ghodss/yaml v1.0.0/go.mod h1:4dBDuWmgqj2HViK6kFavaiC9ZROes6MMH2rRYeMEF04= +github.com/go-gl/glfw v0.0.0-20190409004039-e6da0acd62b1/go.mod h1:vR7hzQXu2zJy9AVAgeJqvqgH9Q5CA+iKCZ2gyEVpxRU= +github.com/go-gl/glfw/v3.3/glfw v0.0.0-20191125211704-12ad95a8df72/go.mod h1:tQ2UAYgL5IevRw8kRxooKSPJfGvJ9fJQFa0TUsXzTg8= +github.com/go-gl/glfw/v3.3/glfw v0.0.0-20200222043503-6f7a984d4dc4/go.mod h1:tQ2UAYgL5IevRw8kRxooKSPJfGvJ9fJQFa0TUsXzTg8= +github.com/go-test/deep v1.0.3/go.mod h1:wGDj63lr65AM2AQyKZd/NYHGb0R+1RLqB8NKt3aSFNA= +github.com/go-test/deep v1.0.7 h1:/VSMRlnY/JSyqxQUzQLKVMAskpY/NZKFA5j2P+0pP2M= +github.com/go-test/deep v1.0.7/go.mod h1:QV8Hv/iy04NyLBxAdO9njL0iVPN1S4d/A3NVv1V36o8= +github.com/golang/glog v0.0.0-20160126235308-23def4e6c14b/go.mod h1:SBH7ygxi8pfUlaOkMMuAQtPIUF8ecWP5IEl/CR7VP2Q= +github.com/golang/groupcache v0.0.0-20190702054246-869f871628b6/go.mod h1:cIg4eruTrX1D+g88fzRXU5OdNfaM+9IcxsU14FzY7Hc= +github.com/golang/groupcache v0.0.0-20191227052852-215e87163ea7/go.mod h1:cIg4eruTrX1D+g88fzRXU5OdNfaM+9IcxsU14FzY7Hc= +github.com/golang/groupcache v0.0.0-20200121045136-8c9f03a8e57e/go.mod h1:cIg4eruTrX1D+g88fzRXU5OdNfaM+9IcxsU14FzY7Hc= +github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da h1:oI5xCqsCo564l8iNU+DwB5epxmsaqB+rhGL0m5jtYqE= +github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da/go.mod h1:cIg4eruTrX1D+g88fzRXU5OdNfaM+9IcxsU14FzY7Hc= +github.com/golang/mock v1.1.1/go.mod h1:oTYuIxOrZwtPieC+H1uAHpcLFnEyAGVDL/k47Jfbm0A= +github.com/golang/mock v1.2.0/go.mod h1:oTYuIxOrZwtPieC+H1uAHpcLFnEyAGVDL/k47Jfbm0A= +github.com/golang/mock v1.3.1/go.mod h1:sBzyDLLjw3U8JLTeZvSv8jJB+tU5PVekmnlKIyFUx0Y= +github.com/golang/mock v1.4.0/go.mod h1:UOMv5ysSaYNkG+OFQykRIcU/QvvxJf3p21QfJ2Bt3cw= +github.com/golang/mock v1.4.1/go.mod h1:UOMv5ysSaYNkG+OFQykRIcU/QvvxJf3p21QfJ2Bt3cw= +github.com/golang/mock v1.4.3/go.mod h1:UOMv5ysSaYNkG+OFQykRIcU/QvvxJf3p21QfJ2Bt3cw= +github.com/golang/mock v1.4.4/go.mod h1:l3mdAwkq5BuhzHwde/uurv3sEJeZMXNpwsxVWU71h+4= +github.com/golang/mock v1.5.0/go.mod h1:CWnOUgYIOo4TcNZ0wHX3YZCqsaM1I1Jvs6v3mP3KVu8= +github.com/golang/mock v1.6.0/go.mod h1:p6yTPP+5HYm5mzsMV8JkE6ZKdX+/wYM6Hr+LicevLPs= +github.com/golang/protobuf v1.1.0/go.mod h1:6lQm79b+lXiMfvg/cZm0SGofjICqVBUtrP5yJMmIC1U= +github.com/golang/protobuf v1.2.0/go.mod h1:6lQm79b+lXiMfvg/cZm0SGofjICqVBUtrP5yJMmIC1U= +github.com/golang/protobuf v1.3.1/go.mod h1:6lQm79b+lXiMfvg/cZm0SGofjICqVBUtrP5yJMmIC1U= +github.com/golang/protobuf v1.3.2/go.mod h1:6lQm79b+lXiMfvg/cZm0SGofjICqVBUtrP5yJMmIC1U= +github.com/golang/protobuf v1.3.3/go.mod h1:vzj43D7+SQXF/4pzW/hwtAqwc6iTitCiVSaWz5lYuqw= +github.com/golang/protobuf v1.3.4/go.mod h1:vzj43D7+SQXF/4pzW/hwtAqwc6iTitCiVSaWz5lYuqw= +github.com/golang/protobuf v1.3.5/go.mod h1:6O5/vntMXwX2lRkT1hjjk0nAC1IDOTvTlVgjlRvqsdk= +github.com/golang/protobuf v1.4.0-rc.1/go.mod h1:ceaxUfeHdC40wWswd/P6IGgMaK3YpKi5j83Wpe3EHw8= +github.com/golang/protobuf v1.4.0-rc.1.0.20200221234624-67d41d38c208/go.mod h1:xKAWHe0F5eneWXFV3EuXVDTCmh+JuBKY0li0aMyXATA= +github.com/golang/protobuf v1.4.0-rc.2/go.mod h1:LlEzMj4AhA7rCAGe4KMBDvJI+AwstrUpVNzEA03Pprs= +github.com/golang/protobuf v1.4.0-rc.4.0.20200313231945-b860323f09d0/go.mod h1:WU3c8KckQ9AFe+yFwt9sWVRKCVIyN9cPHBJSNnbL67w= +github.com/golang/protobuf v1.4.0/go.mod h1:jodUvKwWbYaEsadDk5Fwe5c77LiNKVO9IDvqG2KuDX0= +github.com/golang/protobuf v1.4.1/go.mod h1:U8fpvMrcmy5pZrNK1lt4xCsGvpyWQ/VVv6QDs8UjoX8= +github.com/golang/protobuf v1.4.2/go.mod h1:oDoupMAO8OvCJWAcko0GGGIgR6R6ocIYbsSw735rRwI= +github.com/golang/protobuf v1.4.3/go.mod h1:oDoupMAO8OvCJWAcko0GGGIgR6R6ocIYbsSw735rRwI= +github.com/golang/protobuf v1.5.0/go.mod h1:FsONVRAS9T7sI+LIUmWTfcYkHO4aIWwzhcaSAoJOfIk= +github.com/golang/protobuf v1.5.1/go.mod h1:DopwsBzvsk0Fs44TXzsVbJyPhcCPeIwnvohx4u74HPM= +github.com/golang/protobuf v1.5.2/go.mod h1:XVQd3VNwM+JqD3oG2Ue2ip4fOMUkwXdXDdiuN0vRsmY= +github.com/golang/protobuf v1.5.3 h1:KhyjKVUg7Usr/dYsdSqoFveMYd5ko72D+zANwlG1mmg= +github.com/golang/protobuf v1.5.3/go.mod h1:XVQd3VNwM+JqD3oG2Ue2ip4fOMUkwXdXDdiuN0vRsmY= +github.com/golang/snappy v0.0.3/go.mod h1:/XxbfmMg8lxefKM7IXC3fBNl/7bRcc72aCRzEWrmP2Q= +github.com/google/btree v0.0.0-20180813153112-4030bb1f1f0c/go.mod h1:lNA+9X1NB3Zf8V7Ke586lFgjr2dZNuvo3lPJSGZ5JPQ= +github.com/google/btree v1.0.0/go.mod h1:lNA+9X1NB3Zf8V7Ke586lFgjr2dZNuvo3lPJSGZ5JPQ= +github.com/google/go-cmp v0.2.0/go.mod h1:oXzfMopK8JAjlY9xF4vHSVASa0yLyX7SntLO5aqRK0M= +github.com/google/go-cmp v0.3.0/go.mod h1:8QqcDgzrUqlUb/G2PQTWiueGozuR1884gddMywk6iLU= +github.com/google/go-cmp v0.3.1/go.mod h1:8QqcDgzrUqlUb/G2PQTWiueGozuR1884gddMywk6iLU= +github.com/google/go-cmp v0.4.0/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE= +github.com/google/go-cmp v0.4.1/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE= +github.com/google/go-cmp v0.5.0/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE= +github.com/google/go-cmp v0.5.1/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE= +github.com/google/go-cmp v0.5.2/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE= +github.com/google/go-cmp v0.5.3/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE= +github.com/google/go-cmp v0.5.4/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE= +github.com/google/go-cmp v0.5.5/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE= +github.com/google/go-cmp v0.5.6/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE= +github.com/google/go-cmp v0.5.7/go.mod h1:n+brtR0CgQNWTVd5ZUFpTBC8YFBDLK/h/bpaJ8/DtOE= +github.com/google/go-cmp v0.5.8/go.mod h1:17dUlkBOakJ0+DkrSSNjCkIjxS6bF9zb3elmeNGIjoY= +github.com/google/go-cmp v0.5.9 h1:O2Tfq5qg4qc4AmwVlvv0oLiVAGB7enBSJ2x2DqQFi38= +github.com/google/go-cmp v0.5.9/go.mod h1:17dUlkBOakJ0+DkrSSNjCkIjxS6bF9zb3elmeNGIjoY= +github.com/google/martian v2.1.0+incompatible h1:/CP5g8u/VJHijgedC/Legn3BAbAaWPgecwXBIDzw5no= +github.com/google/martian v2.1.0+incompatible/go.mod h1:9I4somxYTbIHy5NJKHRl3wXiIaQGbYVAs8BPL6v8lEs= +github.com/google/martian/v3 v3.0.0/go.mod h1:y5Zk1BBys9G+gd6Jrk0W3cC1+ELVxBWuIGO+w/tUAp0= +github.com/google/martian/v3 v3.1.0/go.mod h1:y5Zk1BBys9G+gd6Jrk0W3cC1+ELVxBWuIGO+w/tUAp0= +github.com/google/martian/v3 v3.2.1/go.mod h1:oBOf6HBosgwRXnUGWUB05QECsc6uvmMiJ3+6W4l/CUk= +github.com/google/martian/v3 v3.3.2 h1:IqNFLAmvJOgVlpdEBiQbDc2EwKW77amAycfTuWKdfvw= +github.com/google/martian/v3 v3.3.2/go.mod h1:oBOf6HBosgwRXnUGWUB05QECsc6uvmMiJ3+6W4l/CUk= +github.com/google/pprof v0.0.0-20181206194817-3ea8567a2e57/go.mod h1:zfwlbNMJ+OItoe0UupaVj+oy1omPYYDuagoSzA8v9mc= +github.com/google/pprof v0.0.0-20190515194954-54271f7e092f/go.mod h1:zfwlbNMJ+OItoe0UupaVj+oy1omPYYDuagoSzA8v9mc= +github.com/google/pprof v0.0.0-20191218002539-d4f498aebedc/go.mod h1:ZgVRPoUq/hfqzAqh7sHMqb3I9Rq5C59dIz2SbBwJ4eM= +github.com/google/pprof v0.0.0-20200212024743-f11f1df84d12/go.mod h1:ZgVRPoUq/hfqzAqh7sHMqb3I9Rq5C59dIz2SbBwJ4eM= +github.com/google/pprof v0.0.0-20200229191704-1ebb73c60ed3/go.mod h1:ZgVRPoUq/hfqzAqh7sHMqb3I9Rq5C59dIz2SbBwJ4eM= +github.com/google/pprof v0.0.0-20200430221834-fc25d7d30c6d/go.mod h1:ZgVRPoUq/hfqzAqh7sHMqb3I9Rq5C59dIz2SbBwJ4eM= +github.com/google/pprof v0.0.0-20200708004538-1a94d8640e99/go.mod h1:ZgVRPoUq/hfqzAqh7sHMqb3I9Rq5C59dIz2SbBwJ4eM= +github.com/google/pprof v0.0.0-20201023163331-3e6fc7fc9c4c/go.mod h1:kpwsk12EmLew5upagYY7GY0pfYCcupk39gWOCRROcvE= +github.com/google/pprof v0.0.0-20201203190320-1bf35d6f28c2/go.mod h1:kpwsk12EmLew5upagYY7GY0pfYCcupk39gWOCRROcvE= +github.com/google/pprof v0.0.0-20210122040257-d980be63207e/go.mod h1:kpwsk12EmLew5upagYY7GY0pfYCcupk39gWOCRROcvE= +github.com/google/pprof v0.0.0-20210226084205-cbba55b83ad5/go.mod h1:kpwsk12EmLew5upagYY7GY0pfYCcupk39gWOCRROcvE= +github.com/google/pprof v0.0.0-20210601050228-01bbb1931b22/go.mod h1:kpwsk12EmLew5upagYY7GY0pfYCcupk39gWOCRROcvE= +github.com/google/pprof v0.0.0-20210609004039-a478d1d731e9/go.mod h1:kpwsk12EmLew5upagYY7GY0pfYCcupk39gWOCRROcvE= +github.com/google/pprof v0.0.0-20210720184732-4bb14d4b1be1/go.mod h1:kpwsk12EmLew5upagYY7GY0pfYCcupk39gWOCRROcvE= +github.com/google/renameio v0.1.0/go.mod h1:KWCgfxg9yswjAJkECMjeO8J8rahYeXnNhOm40UhjYkI= +github.com/google/uuid v1.1.2/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo= +github.com/google/uuid v1.3.0 h1:t6JiXgmwXMjEs8VusXIJk2BXHsn+wx8BZdTaoZ5fu7I= +github.com/google/uuid v1.3.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo= +github.com/googleapis/enterprise-certificate-proxy v0.0.0-20220520183353-fd19c99a87aa/go.mod h1:17drOmN3MwGY7t0e+Ei9b45FFGA3fBs3x36SsCg1hq8= +github.com/googleapis/enterprise-certificate-proxy v0.1.0/go.mod h1:17drOmN3MwGY7t0e+Ei9b45FFGA3fBs3x36SsCg1hq8= +github.com/googleapis/enterprise-certificate-proxy v0.2.0/go.mod h1:8C0jb7/mgJe/9KK8Lm7X9ctZC2t60YyIpYEI16jx0Qg= +github.com/googleapis/enterprise-certificate-proxy v0.2.3 h1:yk9/cqRKtT9wXZSsRH9aurXEpJX+U6FLtpYTdC3R06k= +github.com/googleapis/enterprise-certificate-proxy v0.2.3/go.mod h1:AwSRAtLfXpU5Nm3pW+v7rGDHp09LsPtGY9MduiEsR9k= +github.com/googleapis/gax-go/v2 v2.0.4/go.mod h1:0Wqv26UfaUD9n4G6kQubkQ+KchISgw+vpHVxEJEs9eg= +github.com/googleapis/gax-go/v2 v2.0.5/go.mod h1:DWXyrwAJ9X0FpwwEdw+IPEYBICEFu5mhpdKc/us6bOk= +github.com/googleapis/gax-go/v2 v2.1.0/go.mod h1:Q3nei7sK6ybPYH7twZdmQpAd1MKb7pfu6SK+H1/DsU0= +github.com/googleapis/gax-go/v2 v2.1.1/go.mod h1:hddJymUZASv3XPyGkUpKj8pPO47Rmb0eJc8R6ouapiM= +github.com/googleapis/gax-go/v2 v2.2.0/go.mod h1:as02EH8zWkzwUoLbBaFeQ+arQaj/OthfcblKl4IGNaM= +github.com/googleapis/gax-go/v2 v2.3.0/go.mod h1:b8LNqSzNabLiUpXKkY7HAR5jr6bIT99EXz9pXxye9YM= +github.com/googleapis/gax-go/v2 v2.4.0/go.mod h1:XOTVJ59hdnfJLIP/dh8n5CGryZR2LxK9wbMD5+iXC6c= +github.com/googleapis/gax-go/v2 v2.5.1/go.mod h1:h6B0KMMFNtI2ddbGJn3T3ZbwkeT6yqEF02fYlzkUCyo= +github.com/googleapis/gax-go/v2 v2.6.0/go.mod h1:1mjbznJAPHFpesgE5ucqfYEscaz5kMdcIDwU/6+DDoY= +github.com/googleapis/gax-go/v2 v2.7.1 h1:gF4c0zjUP2H/s/hEGyLA3I0fA2ZWjzYiONAD6cvPr8A= +github.com/googleapis/gax-go/v2 v2.7.1/go.mod h1:4orTrqY6hXxxaUL4LHIPl6lGo8vAE38/qKbhSAKP6QI= +github.com/googleapis/go-type-adapters v1.0.0/go.mod h1:zHW75FOG2aur7gAO2B+MLby+cLsWGBF62rFAi7WjWO4= +github.com/grpc-ecosystem/grpc-gateway v1.16.0/go.mod h1:BDjrQk3hbvj6Nolgz8mAMFbcEtjT1g+wF4CSlocrBnw= +github.com/gruntwork-io/terratest v0.47.2 h1:t6iWwsqJH7Gx0RwXleU/vjc+2c0JXRMdj3DxYXTBssQ= +github.com/gruntwork-io/terratest v0.47.2/go.mod h1:LnYX8BN5WxUMpDr8rtD39oToSL4CBERWSCusbJ0d/64= +github.com/hashicorp/errwrap v1.0.0 h1:hLrqtEDnRye3+sgx6z4qVLNuviH3MR5aQ0ykNJa/UYA= +github.com/hashicorp/errwrap v1.0.0/go.mod h1:YH+1FKiLXxHSkmPseP+kNlulaMuP3n2brvKWEqk/Jc4= +github.com/hashicorp/go-cleanhttp v0.5.2 h1:035FKYIWjmULyFRBKPs8TBQoi0x6d9G4xc9neXJWAZQ= +github.com/hashicorp/go-cleanhttp v0.5.2/go.mod h1:kO/YDlP8L1346E6Sodw+PrpBSV4/SoxCXGY6BqNFT48= +github.com/hashicorp/go-getter v1.7.6 h1:5jHuM+aH373XNtXl9TNTUH5Qd69Trve11tHIrB+6yj4= +github.com/hashicorp/go-getter v1.7.6/go.mod h1:W7TalhMmbPmsSMdNjD0ZskARur/9GJ17cfHTRtXV744= +github.com/hashicorp/go-multierror v1.1.0 h1:B9UzwGQJehnUY1yNrnwREHc3fGbC2xefo8g4TbElacI= +github.com/hashicorp/go-multierror v1.1.0/go.mod h1:spPvp8C1qA32ftKqdAHm4hHTbPw+vmowP0z+KUhOZdA= +github.com/hashicorp/go-safetemp v1.0.0 h1:2HR189eFNrjHQyENnQMMpCiBAsRxzbTMIgBhEyExpmo= +github.com/hashicorp/go-safetemp v1.0.0/go.mod h1:oaerMy3BhqiTbVye6QuFhFtIceqFoDHxNAB65b+Rj1I= +github.com/hashicorp/go-version v1.3.0/go.mod h1:fltr4n8CU8Ke44wwGCBoEymUuxUHl09ZGVZPK5anwXA= +github.com/hashicorp/go-version v1.6.0 h1:feTTfFNnjP967rlCxM/I9g701jU+RN74YKx2mOkIeek= +github.com/hashicorp/go-version v1.6.0/go.mod h1:fltr4n8CU8Ke44wwGCBoEymUuxUHl09ZGVZPK5anwXA= +github.com/hashicorp/golang-lru v0.5.0/go.mod h1:/m3WP610KZHVQ1SGc6re/UDhFvYD7pJ4Ao+sR/qLZy8= +github.com/hashicorp/golang-lru v0.5.1/go.mod h1:/m3WP610KZHVQ1SGc6re/UDhFvYD7pJ4Ao+sR/qLZy8= +github.com/hashicorp/hcl/v2 v2.9.1 h1:eOy4gREY0/ZQHNItlfuEZqtcQbXIxzojlP301hDpnac= +github.com/hashicorp/hcl/v2 v2.9.1/go.mod h1:FwWsfWEjyV/CMj8s/gqAuiviY72rJ1/oayI9WftqcKg= +github.com/hashicorp/terraform-json v0.13.0 h1:Li9L+lKD1FO5RVFRM1mMMIBDoUHslOniyEi5CM+FWGY= +github.com/hashicorp/terraform-json v0.13.0/go.mod h1:y5OdLBCT+rxbwnpxZs9kGL7R9ExU76+cpdY8zHwoazk= +github.com/ianlancetaylor/demangle v0.0.0-20181102032728-5e5cf60278f6/go.mod h1:aSSvb/t6k1mPoxDqO4vJh6VOCGPwU4O0C2/Eqndh1Sc= +github.com/ianlancetaylor/demangle v0.0.0-20200824232613-28f6c0f3b639/go.mod h1:aSSvb/t6k1mPoxDqO4vJh6VOCGPwU4O0C2/Eqndh1Sc= +github.com/jinzhu/copier v0.0.0-20190924061706-b57f9002281a h1:zPPuIq2jAWWPTrGt70eK/BSch+gFAGrNzecsoENgu2o= +github.com/jinzhu/copier v0.0.0-20190924061706-b57f9002281a/go.mod h1:yL958EeXv8Ylng6IfnvG4oflryUi3vgA3xPs9hmII1s= +github.com/jmespath/go-jmespath v0.4.0 h1:BEgLn5cpjn8UN1mAw4NjwDrS35OdebyEtFe+9YPoQUg= +github.com/jmespath/go-jmespath v0.4.0/go.mod h1:T8mJZnbsbmF+m6zOOFylbeCJqk5+pHWvzYPziyZiYoo= +github.com/jmespath/go-jmespath/internal/testify v1.5.1 h1:shLQSRRSCCPj3f2gpwzGwWFoC7ycTf1rcQZHOlsJ6N8= +github.com/jmespath/go-jmespath/internal/testify v1.5.1/go.mod h1:L3OGu8Wl2/fWfCI6z80xFu9LTZmf1ZRjMHUOPmWr69U= +github.com/jstemmer/go-junit-report v0.0.0-20190106144839-af01ea7f8024/go.mod h1:6v2b51hI/fHJwM22ozAgKL4VKDeJcHhJFhtBdhmNjmU= +github.com/jstemmer/go-junit-report v0.9.1/go.mod h1:Brl9GWCQeLvo8nXZwPNNblvFj/XSXhF0NWZEnDohbsk= +github.com/kisielk/gotool v1.0.0/go.mod h1:XhKaO+MFFWcvkIS/tQcRk01m1F5IRFswLeQ+oQHNcck= +github.com/klauspost/compress v1.15.11 h1:Lcadnb3RKGin4FYM/orgq0qde+nc15E5Cbqg4B9Sx9c= +github.com/klauspost/compress v1.15.11/go.mod h1:QPwzmACJjUTFsnSHH934V6woptycfrDDJnH7hvFVbGM= +github.com/kr/pretty v0.1.0/go.mod h1:dAy3ld7l9f0ibDNOQOHHMYYIIbhfbHSm3C4ZsoJORNo= +github.com/kr/pretty v0.2.1 h1:Fmg33tUaq4/8ym9TJN1x7sLJnHVwhP33CNkpYV/7rwI= +github.com/kr/pretty v0.2.1/go.mod h1:ipq/a2n7PKx3OHsz4KJII5eveXtPO4qwEXGdVfWzfnI= +github.com/kr/pty v1.1.1/go.mod h1:pFQYn66WHrOpPYNljwOMqo10TkYh1fy3cYio2l3bCsQ= +github.com/kr/text v0.1.0/go.mod h1:4Jbv+DJW3UT/LiOwJeYQe1efqtUx/iVham/4vfdArNI= +github.com/kr/text v0.2.0 h1:5Nx0Ya0ZqY2ygV366QzturHI13Jq95ApcVaJBhpS+AY= +github.com/kr/text v0.2.0/go.mod h1:eLer722TekiGuMkidMxC/pM04lWEeraHUUmBw8l2grE= +github.com/kylelemons/godebug v0.0.0-20170820004349-d65d576e9348/go.mod h1:B69LEHPfb2qLo0BaaOLcbitczOKLWTsrBG9LczfCD4k= +github.com/kylelemons/godebug v1.1.0 h1:RPNrshWIDI6G2gRW9EHilWtl7Z6Sb1BR0xunSBf0SNc= +github.com/kylelemons/godebug v1.1.0/go.mod h1:9/0rRGxNHcop5bhtWyNeEfOS8JIWk580+fNqagV/RAw= +github.com/mattn/go-colorable v0.0.9/go.mod h1:9vuHe8Xs5qXnSaW/c/ABM9alt+Vo+STaOChaDxuIBZU= +github.com/mattn/go-isatty v0.0.4/go.mod h1:M+lRXTBqGeGNdLjl/ufCoiOlB5xdOkqRJdNxMWT7Zi4= +github.com/mattn/go-runewidth v0.0.4/go.mod h1:LwmH8dsx7+W8Uxz3IHJYH5QSwggIsqBzpuz5H//U1FU= +github.com/mattn/go-zglob v0.0.2-0.20190814121620-e3c945676326 h1:ofNAzWCcyTALn2Zv40+8XitdzCgXY6e9qvXwN9W0YXg= +github.com/mattn/go-zglob v0.0.2-0.20190814121620-e3c945676326/go.mod h1:9fxibJccNxU2cnpIKLRRFA7zX7qhkJIQWBb449FYHOo= +github.com/mitchellh/copystructure v1.2.0/go.mod h1:qLl+cE2AmVv+CoeAwDPye/v+N2HKCj9FbZEVFJRxO9s= +github.com/mitchellh/go-homedir v1.1.0 h1:lukF9ziXFxDFPkA1vsr5zpc1XuPDn/wFntq5mG+4E0Y= +github.com/mitchellh/go-homedir v1.1.0/go.mod h1:SfyaCUpYCn1Vlf4IUYiD9fPX4A5wJrkLzIz1N1q0pr0= +github.com/mitchellh/go-testing-interface v1.14.1 h1:jrgshOhYAUVNMAJiKbEu7EqAwgJJ2JqpQmpLJOu07cU= +github.com/mitchellh/go-testing-interface v1.14.1/go.mod h1:gfgS7OtZj6MA4U1UrDRp04twqAjfvlZyCfX3sDjEym8= +github.com/mitchellh/go-wordwrap v0.0.0-20150314170334-ad45545899c7/go.mod h1:ZXFpozHsX6DPmq2I0TCekCxypsnAUbP2oI0UX1GXzOo= +github.com/mitchellh/go-wordwrap v1.0.1 h1:TLuKupo69TCn6TQSyGxwI1EblZZEsQ0vMlAFQflz0v0= +github.com/mitchellh/go-wordwrap v1.0.1/go.mod h1:R62XHJLzvMFRBbcrT7m7WgmE1eOyTSsCt+hzestvNj0= +github.com/mitchellh/reflectwalk v1.0.2/go.mod h1:mSTlrgnPZtwu0c4WaC2kGObEpuNDbx0jmZXqmk4esnw= +github.com/pkg/errors v0.9.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0= +github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM= +github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= +github.com/prometheus/client_model v0.0.0-20190812154241-14fe0d1b01d4/go.mod h1:xMI15A0UPsDsEKsMN9yxemIoYk6Tm2C1GtYGdfGttqA= +github.com/rogpeppe/fastuuid v1.2.0/go.mod h1:jVj6XXZzXRy/MSR5jhDC/2q6DgLz+nrA6LYCDYWNEvQ= +github.com/rogpeppe/go-internal v1.3.0/go.mod h1:M8bDsm7K2OlrFYOpmOWEs/qY81heoFRclV5y23lUDJ4= +github.com/sebdah/goldie v1.0.0/go.mod h1:jXP4hmWywNEwZzhMuv2ccnqTSFpuq8iyQhtQdkkZBH4= +github.com/sergi/go-diff v1.0.0/go.mod h1:0CfEIISq7TuYL3j771MWULgwwjU+GofnZX9QAmXWZgo= +github.com/spaolacci/murmur3 v0.0.0-20180118202830-f09979ecbc72/go.mod h1:JwIasOWyU6f++ZhiEuf87xNszmSA2myDM2Kzu9HwQUA= +github.com/spf13/pflag v1.0.2/go.mod h1:DYY7MBk1bdzusC3SYhjObp+wFpr4gzcvqqNjLnInEg4= +github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME= +github.com/stretchr/objx v0.4.0/go.mod h1:YvHI0jy2hoMjB+UWwv71VJQ9isScKT/TqJzVSSt89Yw= +github.com/stretchr/objx v0.5.0/go.mod h1:Yh+to48EsGEfYuaHDzXPcE3xhTkx73EhmCGUpEOglKo= +github.com/stretchr/testify v1.2.2/go.mod h1:a8OnRcib4nhh0OaRAV+Yts87kKdq0PP7pXfy6kDkUVs= +github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI= +github.com/stretchr/testify v1.4.0/go.mod h1:j7eGeouHqKxXV5pUuKE4zz7dFj8WfuZ+81PSLYec5m4= +github.com/stretchr/testify v1.5.1/go.mod h1:5W2xD1RspED5o8YsWQXVCued0rvSQ+mT+I5cxcmMvtA= +github.com/stretchr/testify v1.6.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg= +github.com/stretchr/testify v1.7.0/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg= +github.com/stretchr/testify v1.7.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg= +github.com/stretchr/testify v1.8.0/go.mod h1:yNjHg4UonilssWZ8iaSj1OCr/vHnekPRkoO+kdMU+MU= +github.com/stretchr/testify v1.8.1/go.mod h1:w2LPCIKwWwSfY2zedu0+kehJoqGctiVI29o6fzry7u4= +github.com/stretchr/testify v1.9.0 h1:HtqpIVDClZ4nwg75+f6Lvsy/wHu+3BoSGCbBAcpTsTg= +github.com/stretchr/testify v1.9.0/go.mod h1:r2ic/lqez/lEtzL7wO/rwa5dbSLXVDPFyf8C91i36aY= +github.com/tmccombs/hcl2json v0.3.3 h1:+DLNYqpWE0CsOQiEZu+OZm5ZBImake3wtITYxQ8uLFQ= +github.com/tmccombs/hcl2json v0.3.3/go.mod h1:Y2chtz2x9bAeRTvSibVRVgbLJhLJXKlUeIvjeVdnm4w= +github.com/ulikunitz/xz v0.5.10 h1:t92gobL9l3HE202wg3rlk19F6X+JOxl9BBrCCMYEYd8= +github.com/ulikunitz/xz v0.5.10/go.mod h1:nbz6k7qbPmH4IRqmfOplQw/tblSgqTqBwxkY0oWt/14= +github.com/vmihailenco/msgpack v3.3.3+incompatible/go.mod h1:fy3FlTQTDXWkZ7Bh6AcGMlsjHatGryHQYUTf1ShIgkk= +github.com/vmihailenco/msgpack/v4 v4.3.12/go.mod h1:gborTTJjAo/GWTqqRjrLCn9pgNN+NXzzngzBKDPIqw4= +github.com/vmihailenco/tagparser v0.1.1/go.mod h1:OeAg3pn3UbLjkWt+rN9oFYB6u/cQgqMEUPoW2WPyhdI= +github.com/yuin/goldmark v1.1.25/go.mod h1:3hX8gzYuyVAZsxl0MRgGTJEmQBFcNTphYh9decYSb74= +github.com/yuin/goldmark v1.1.27/go.mod h1:3hX8gzYuyVAZsxl0MRgGTJEmQBFcNTphYh9decYSb74= +github.com/yuin/goldmark v1.1.32/go.mod h1:3hX8gzYuyVAZsxl0MRgGTJEmQBFcNTphYh9decYSb74= +github.com/yuin/goldmark v1.2.1/go.mod h1:3hX8gzYuyVAZsxl0MRgGTJEmQBFcNTphYh9decYSb74= +github.com/yuin/goldmark v1.3.5/go.mod h1:mwnBkeHKe2W/ZEtQ+71ViKU8L12m81fl3OWwC1Zlc8k= +github.com/yuin/goldmark v1.4.13/go.mod h1:6yULJ656Px+3vBD8DxQVa3kxgyrAnzto9xy5taEt/CY= +github.com/zclconf/go-cty v1.2.0/go.mod h1:hOPWgoHbaTUnI5k4D2ld+GRpFJSCe6bCM7m1q/N4PQ8= +github.com/zclconf/go-cty v1.8.0/go.mod h1:vVKLxnk3puL4qRAv72AO+W99LUD4da90g3uUAzyuvAk= +github.com/zclconf/go-cty v1.8.1/go.mod h1:vVKLxnk3puL4qRAv72AO+W99LUD4da90g3uUAzyuvAk= +github.com/zclconf/go-cty v1.9.1 h1:viqrgQwFl5UpSxc046qblj78wZXVDFnSOufaOTER+cc= +github.com/zclconf/go-cty v1.9.1/go.mod h1:vVKLxnk3puL4qRAv72AO+W99LUD4da90g3uUAzyuvAk= +github.com/zclconf/go-cty-debug v0.0.0-20191215020915-b22d67c1ba0b/go.mod h1:ZRKQfBXbGkpdV6QMzT3rU1kSTAnfu1dO8dPKjYprgj8= +go.opencensus.io v0.21.0/go.mod h1:mSImk1erAIZhrmZN+AvHh14ztQfjbGwt4TtuofqLduU= +go.opencensus.io v0.22.0/go.mod h1:+kGneAE2xo2IficOXnaByMWTGM9T73dGwxeWcUqIpI8= +go.opencensus.io v0.22.2/go.mod h1:yxeiOL68Rb0Xd1ddK5vPZ/oVn4vY4Ynel7k9FzqtOIw= +go.opencensus.io v0.22.3/go.mod h1:yxeiOL68Rb0Xd1ddK5vPZ/oVn4vY4Ynel7k9FzqtOIw= +go.opencensus.io v0.22.4/go.mod h1:yxeiOL68Rb0Xd1ddK5vPZ/oVn4vY4Ynel7k9FzqtOIw= +go.opencensus.io v0.22.5/go.mod h1:5pWMHQbX5EPX2/62yrJeAkowc+lfs/XD7Uxpq3pI6kk= +go.opencensus.io v0.23.0/go.mod h1:XItmlyltB5F7CS4xOC1DcqMoFqwtC6OG2xF7mCv7P7E= +go.opencensus.io v0.24.0 h1:y73uSU6J157QMP2kn2r30vwW1A2W2WFwSCGnAVxeaD0= +go.opencensus.io v0.24.0/go.mod h1:vNK8G9p7aAivkbmorf4v+7Hgx+Zs0yY+0fOtgBfjQKo= +go.opentelemetry.io/proto/otlp v0.7.0/go.mod h1:PqfVotwruBrMGOCsRd/89rSnXhoiJIqeYNgFYFoEGnI= +golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w= +golang.org/x/crypto v0.0.0-20190426145343-a29dc8fdc734/go.mod h1:yigFU9vqHzYiE8UmvKecakEJjdnWj3jj499lnFckfCI= +golang.org/x/crypto v0.0.0-20190510104115-cbcb75029529/go.mod h1:yigFU9vqHzYiE8UmvKecakEJjdnWj3jj499lnFckfCI= +golang.org/x/crypto v0.0.0-20190605123033-f99c8df09eb5/go.mod h1:yigFU9vqHzYiE8UmvKecakEJjdnWj3jj499lnFckfCI= +golang.org/x/crypto v0.0.0-20191011191535-87dc89f01550/go.mod h1:yigFU9vqHzYiE8UmvKecakEJjdnWj3jj499lnFckfCI= +golang.org/x/crypto v0.0.0-20200622213623-75b288015ac9/go.mod h1:LzIPMQfyMNhhGPhUkYOs5KpL4U8rLKemX1yGLhDgUto= +golang.org/x/crypto v0.0.0-20210921155107-089bfa567519/go.mod h1:GvvjBRRGRdwPK5ydBHafDWAxML/pGHZbMvKqRZ5+Abc= +golang.org/x/crypto v0.21.0 h1:X31++rzVUdKhX5sWmSOFZxx8UW/ldWx55cbf08iNAMA= +golang.org/x/crypto v0.21.0/go.mod h1:0BP7YvVV9gBbVKyeTG0Gyn+gZm94bibOW5BjDEYAOMs= +golang.org/x/exp v0.0.0-20190121172915-509febef88a4/go.mod h1:CJ0aWSM057203Lf6IL+f9T1iT9GByDxfZKAQTCR3kQA= +golang.org/x/exp v0.0.0-20190306152737-a1d7652674e8/go.mod h1:CJ0aWSM057203Lf6IL+f9T1iT9GByDxfZKAQTCR3kQA= +golang.org/x/exp v0.0.0-20190510132918-efd6b22b2522/go.mod h1:ZjyILWgesfNpC6sMxTJOJm9Kp84zZh5NQWvqDGG3Qr8= +golang.org/x/exp v0.0.0-20190829153037-c13cbed26979/go.mod h1:86+5VVa7VpoJ4kLfm080zCjGlMRFzhUhsZKEZO7MGek= +golang.org/x/exp v0.0.0-20191030013958-a1ab85dbe136/go.mod h1:JXzH8nQsPlswgeRAPE3MuO9GYsAcnJvJ4vnMwN/5qkY= +golang.org/x/exp v0.0.0-20191129062945-2f5052295587/go.mod h1:2RIsYlXP63K8oxa1u096TMicItID8zy7Y6sNkU49FU4= +golang.org/x/exp v0.0.0-20191227195350-da58074b4299/go.mod h1:2RIsYlXP63K8oxa1u096TMicItID8zy7Y6sNkU49FU4= +golang.org/x/exp v0.0.0-20200119233911-0405dc783f0a/go.mod h1:2RIsYlXP63K8oxa1u096TMicItID8zy7Y6sNkU49FU4= +golang.org/x/exp v0.0.0-20200207192155-f17229e696bd/go.mod h1:J/WKrq2StrnmMY6+EHIKF9dgMWnmCNThgcyBT1FY9mM= +golang.org/x/exp v0.0.0-20200224162631-6cc2880d07d6/go.mod h1:3jZMyOhIsHpP37uCMkUooju7aAi5cS1Q23tOzKc+0MU= +golang.org/x/image v0.0.0-20190227222117-0694c2d4d067/go.mod h1:kZ7UVZpmo3dzQBMxlp+ypCbDeSB+sBbTgSJuh5dn5js= +golang.org/x/image v0.0.0-20190802002840-cff245a6509b/go.mod h1:FeLwcggjj3mMvU+oOTbSwawSJRM1uh48EjtB4UJZlP0= +golang.org/x/lint v0.0.0-20181026193005-c67002cb31c3/go.mod h1:UVdnD1Gm6xHRNCYTkRU2/jEulfH38KcIWyp/GAMgvoE= +golang.org/x/lint v0.0.0-20190227174305-5b3e6a55c961/go.mod h1:wehouNa3lNwaWXcvxsM5YxQ5yQlVC4a0KAMCusXpPoU= +golang.org/x/lint v0.0.0-20190301231843-5614ed5bae6f/go.mod h1:UVdnD1Gm6xHRNCYTkRU2/jEulfH38KcIWyp/GAMgvoE= +golang.org/x/lint v0.0.0-20190313153728-d0100b6bd8b3/go.mod h1:6SW0HCj/g11FgYtHlgUYUwCkIfeOF89ocIRzGO/8vkc= +golang.org/x/lint v0.0.0-20190409202823-959b441ac422/go.mod h1:6SW0HCj/g11FgYtHlgUYUwCkIfeOF89ocIRzGO/8vkc= +golang.org/x/lint v0.0.0-20190909230951-414d861bb4ac/go.mod h1:6SW0HCj/g11FgYtHlgUYUwCkIfeOF89ocIRzGO/8vkc= +golang.org/x/lint v0.0.0-20190930215403-16217165b5de/go.mod h1:6SW0HCj/g11FgYtHlgUYUwCkIfeOF89ocIRzGO/8vkc= +golang.org/x/lint v0.0.0-20191125180803-fdd1cda4f05f/go.mod h1:5qLYkcX4OjUUV8bRuDixDT3tpyyb+LUpUlRWLxfhWrs= +golang.org/x/lint v0.0.0-20200130185559-910be7a94367/go.mod h1:3xt1FjdF8hUf6vQPIChWIBhFzV8gjjsPE/fR3IyQdNY= +golang.org/x/lint v0.0.0-20200302205851-738671d3881b/go.mod h1:3xt1FjdF8hUf6vQPIChWIBhFzV8gjjsPE/fR3IyQdNY= +golang.org/x/lint v0.0.0-20201208152925-83fdc39ff7b5/go.mod h1:3xt1FjdF8hUf6vQPIChWIBhFzV8gjjsPE/fR3IyQdNY= +golang.org/x/lint v0.0.0-20210508222113-6edffad5e616/go.mod h1:3xt1FjdF8hUf6vQPIChWIBhFzV8gjjsPE/fR3IyQdNY= +golang.org/x/mobile v0.0.0-20190312151609-d3739f865fa6/go.mod h1:z+o9i4GpDbdi3rU15maQ/Ox0txvL9dWGYEHz965HBQE= +golang.org/x/mobile v0.0.0-20190719004257-d2bd2a29d028/go.mod h1:E/iHnbuqvinMTCcRqshq8CkpyQDoeVncDDYHnLhea+o= +golang.org/x/mod v0.0.0-20190513183733-4bf6d317e70e/go.mod h1:mXi4GBBbnImb6dmsKGUJ2LatrhH/nqhxcFungHvyanc= +golang.org/x/mod v0.1.0/go.mod h1:0QHyrYULN0/3qlju5TqG8bIK38QM8yzMo5ekMj3DlcY= +golang.org/x/mod v0.1.1-0.20191105210325-c90efee705ee/go.mod h1:QqPTAvyqsEbceGzBzNggFXnrqF1CaUcvgkdR5Ot7KZg= +golang.org/x/mod v0.1.1-0.20191107180719-034126e5016b/go.mod h1:QqPTAvyqsEbceGzBzNggFXnrqF1CaUcvgkdR5Ot7KZg= +golang.org/x/mod v0.2.0/go.mod h1:s0Qsj1ACt9ePp/hMypM3fl4fZqREWJwdYDEqhRiZZUA= +golang.org/x/mod v0.3.0/go.mod h1:s0Qsj1ACt9ePp/hMypM3fl4fZqREWJwdYDEqhRiZZUA= +golang.org/x/mod v0.4.0/go.mod h1:s0Qsj1ACt9ePp/hMypM3fl4fZqREWJwdYDEqhRiZZUA= +golang.org/x/mod v0.4.1/go.mod h1:s0Qsj1ACt9ePp/hMypM3fl4fZqREWJwdYDEqhRiZZUA= +golang.org/x/mod v0.4.2/go.mod h1:s0Qsj1ACt9ePp/hMypM3fl4fZqREWJwdYDEqhRiZZUA= +golang.org/x/mod v0.6.0-dev.0.20220419223038-86c51ed26bb4/go.mod h1:jJ57K6gSWd91VN4djpZkiMVwK6gcyfeH4XE8wZrZaV4= +golang.org/x/net v0.0.0-20180724234803-3673e40ba225/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4= +golang.org/x/net v0.0.0-20180811021610-c39426892332/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4= +golang.org/x/net v0.0.0-20180826012351-8a410e7b638d/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4= +golang.org/x/net v0.0.0-20190108225652-1e06a53dbb7e/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4= +golang.org/x/net v0.0.0-20190213061140-3a22650c66bd/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4= +golang.org/x/net v0.0.0-20190311183353-d8887717615a/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg= +golang.org/x/net v0.0.0-20190404232315-eb5bcb51f2a3/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg= +golang.org/x/net v0.0.0-20190501004415-9ce7a6920f09/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg= +golang.org/x/net v0.0.0-20190503192946-f4e77d36d62c/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg= +golang.org/x/net v0.0.0-20190603091049-60506f45cf65/go.mod h1:HSz+uSET+XFnRR8LxR5pz3Of3rY3CfYBVs4xY44aLks= +golang.org/x/net v0.0.0-20190620200207-3b0461eec859/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s= +golang.org/x/net v0.0.0-20190628185345-da137c7871d7/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s= +golang.org/x/net v0.0.0-20190724013045-ca1201d0de80/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s= +golang.org/x/net v0.0.0-20191209160850-c0dbc17a3553/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s= +golang.org/x/net v0.0.0-20200114155413-6afb5195e5aa/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s= +golang.org/x/net v0.0.0-20200202094626-16171245cfb2/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s= +golang.org/x/net v0.0.0-20200222125558-5a598a2470a0/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s= +golang.org/x/net v0.0.0-20200226121028-0de0cce0169b/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s= +golang.org/x/net v0.0.0-20200301022130-244492dfa37a/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s= +golang.org/x/net v0.0.0-20200324143707-d3edc9973b7e/go.mod h1:qpuaurCH72eLCgpAm/N6yyVIVM9cpaDIP3A8BGJEC5A= +golang.org/x/net v0.0.0-20200501053045-e0ff5e5a1de5/go.mod h1:qpuaurCH72eLCgpAm/N6yyVIVM9cpaDIP3A8BGJEC5A= +golang.org/x/net v0.0.0-20200506145744-7e3656a0809f/go.mod h1:qpuaurCH72eLCgpAm/N6yyVIVM9cpaDIP3A8BGJEC5A= +golang.org/x/net v0.0.0-20200513185701-a91f0712d120/go.mod h1:qpuaurCH72eLCgpAm/N6yyVIVM9cpaDIP3A8BGJEC5A= +golang.org/x/net v0.0.0-20200520182314-0ba52f642ac2/go.mod h1:qpuaurCH72eLCgpAm/N6yyVIVM9cpaDIP3A8BGJEC5A= +golang.org/x/net v0.0.0-20200625001655-4c5254603344/go.mod h1:/O7V0waA8r7cgGh81Ro3o1hOxt32SMVPicZroKQ2sZA= +golang.org/x/net v0.0.0-20200707034311-ab3426394381/go.mod h1:/O7V0waA8r7cgGh81Ro3o1hOxt32SMVPicZroKQ2sZA= +golang.org/x/net v0.0.0-20200822124328-c89045814202/go.mod h1:/O7V0waA8r7cgGh81Ro3o1hOxt32SMVPicZroKQ2sZA= +golang.org/x/net v0.0.0-20201021035429-f5854403a974/go.mod h1:sp8m0HH+o8qH0wwXwYZr8TS3Oi6o0r6Gce1SSxlDquU= +golang.org/x/net v0.0.0-20201031054903-ff519b6c9102/go.mod h1:sp8m0HH+o8qH0wwXwYZr8TS3Oi6o0r6Gce1SSxlDquU= +golang.org/x/net v0.0.0-20201110031124-69a78807bb2b/go.mod h1:sp8m0HH+o8qH0wwXwYZr8TS3Oi6o0r6Gce1SSxlDquU= +golang.org/x/net v0.0.0-20201209123823-ac852fbbde11/go.mod h1:m0MpNAwzfU5UDzcl9v0D8zg8gWTRqZa9RBIspLL5mdg= +golang.org/x/net v0.0.0-20210119194325-5f4716e94777/go.mod h1:m0MpNAwzfU5UDzcl9v0D8zg8gWTRqZa9RBIspLL5mdg= +golang.org/x/net v0.0.0-20210226172049-e18ecbb05110/go.mod h1:m0MpNAwzfU5UDzcl9v0D8zg8gWTRqZa9RBIspLL5mdg= +golang.org/x/net v0.0.0-20210316092652-d523dce5a7f4/go.mod h1:RBQZq4jEuRlivfhVLdyRGr576XBO4/greRjx4P4O3yc= +golang.org/x/net v0.0.0-20210405180319-a5a99cb37ef4/go.mod h1:p54w0d4576C0XHj96bSt6lcn1PtDYWL6XObtHCRCNQM= +golang.org/x/net v0.0.0-20210503060351-7fd8e65b6420/go.mod h1:9nx3DQGgdP8bBQD5qxJ1jj9UTztislL4KSBs9R2vV5Y= +golang.org/x/net v0.0.0-20220127200216-cd36cc0744dd/go.mod h1:CfG3xpIq0wQ8r1q4Su4UZFWDARRcnwPjda9FqA0JpMk= +golang.org/x/net v0.0.0-20220225172249-27dd8689420f/go.mod h1:CfG3xpIq0wQ8r1q4Su4UZFWDARRcnwPjda9FqA0JpMk= +golang.org/x/net v0.0.0-20220325170049-de3da57026de/go.mod h1:CfG3xpIq0wQ8r1q4Su4UZFWDARRcnwPjda9FqA0JpMk= +golang.org/x/net v0.0.0-20220412020605-290c469a71a5/go.mod h1:CfG3xpIq0wQ8r1q4Su4UZFWDARRcnwPjda9FqA0JpMk= +golang.org/x/net v0.0.0-20220425223048-2871e0cb64e4/go.mod h1:CfG3xpIq0wQ8r1q4Su4UZFWDARRcnwPjda9FqA0JpMk= +golang.org/x/net v0.0.0-20220607020251-c690dde0001d/go.mod h1:XRhObCWvk6IyKnWLug+ECip1KBveYUHfp+8e9klMJ9c= +golang.org/x/net v0.0.0-20220617184016-355a448f1bc9/go.mod h1:XRhObCWvk6IyKnWLug+ECip1KBveYUHfp+8e9klMJ9c= +golang.org/x/net v0.0.0-20220624214902-1bab6f366d9e/go.mod h1:XRhObCWvk6IyKnWLug+ECip1KBveYUHfp+8e9klMJ9c= +golang.org/x/net v0.0.0-20220722155237-a158d28d115b/go.mod h1:XRhObCWvk6IyKnWLug+ECip1KBveYUHfp+8e9klMJ9c= +golang.org/x/net v0.0.0-20220909164309-bea034e7d591/go.mod h1:YDH+HFinaLZZlnHAfSS6ZXJJ9M9t4Dl22yv3iI2vPwk= +golang.org/x/net v0.0.0-20221014081412-f15817d10f9b/go.mod h1:YDH+HFinaLZZlnHAfSS6ZXJJ9M9t4Dl22yv3iI2vPwk= +golang.org/x/net v0.1.0/go.mod h1:Cx3nUiGt4eDBEyega/BKRp+/AlGL8hYe7U9odMt2Cco= +golang.org/x/net v0.23.0 h1:7EYJ93RZ9vYSZAIb2x3lnuvqO5zneoD6IvWjuhfxjTs= +golang.org/x/net v0.23.0/go.mod h1:JKghWKKOSdJwpW2GEx0Ja7fmaKnMsbu+MWVZTokSYmg= +golang.org/x/oauth2 v0.0.0-20180821212333-d2e6202438be/go.mod h1:N/0e6XlmueqKjAGxoOufVs8QHGRruUQn6yWY3a++T0U= +golang.org/x/oauth2 v0.0.0-20190226205417-e64efc72b421/go.mod h1:gOpvHmFTYa4IltrdGE7lF6nIHvwfUNPOp7c8zoXwtLw= +golang.org/x/oauth2 v0.0.0-20190604053449-0f29369cfe45/go.mod h1:gOpvHmFTYa4IltrdGE7lF6nIHvwfUNPOp7c8zoXwtLw= +golang.org/x/oauth2 v0.0.0-20191202225959-858c2ad4c8b6/go.mod h1:gOpvHmFTYa4IltrdGE7lF6nIHvwfUNPOp7c8zoXwtLw= +golang.org/x/oauth2 v0.0.0-20200107190931-bf48bf16ab8d/go.mod h1:gOpvHmFTYa4IltrdGE7lF6nIHvwfUNPOp7c8zoXwtLw= +golang.org/x/oauth2 v0.0.0-20200902213428-5d25da1a8d43/go.mod h1:KelEdhl1UZF7XfJ4dDtk6s++YSgaE7mD/BuKKDLBl4A= +golang.org/x/oauth2 v0.0.0-20201109201403-9fd604954f58/go.mod h1:KelEdhl1UZF7XfJ4dDtk6s++YSgaE7mD/BuKKDLBl4A= +golang.org/x/oauth2 v0.0.0-20201208152858-08078c50e5b5/go.mod h1:KelEdhl1UZF7XfJ4dDtk6s++YSgaE7mD/BuKKDLBl4A= +golang.org/x/oauth2 v0.0.0-20210218202405-ba52d332ba99/go.mod h1:KelEdhl1UZF7XfJ4dDtk6s++YSgaE7mD/BuKKDLBl4A= +golang.org/x/oauth2 v0.0.0-20210220000619-9bb904979d93/go.mod h1:KelEdhl1UZF7XfJ4dDtk6s++YSgaE7mD/BuKKDLBl4A= +golang.org/x/oauth2 v0.0.0-20210313182246-cd4f82c27b84/go.mod h1:KelEdhl1UZF7XfJ4dDtk6s++YSgaE7mD/BuKKDLBl4A= +golang.org/x/oauth2 v0.0.0-20210514164344-f6687ab2804c/go.mod h1:KelEdhl1UZF7XfJ4dDtk6s++YSgaE7mD/BuKKDLBl4A= +golang.org/x/oauth2 v0.0.0-20210628180205-a41e5a781914/go.mod h1:KelEdhl1UZF7XfJ4dDtk6s++YSgaE7mD/BuKKDLBl4A= +golang.org/x/oauth2 v0.0.0-20210805134026-6f1e6394065a/go.mod h1:KelEdhl1UZF7XfJ4dDtk6s++YSgaE7mD/BuKKDLBl4A= +golang.org/x/oauth2 v0.0.0-20210819190943-2bc19b11175f/go.mod h1:KelEdhl1UZF7XfJ4dDtk6s++YSgaE7mD/BuKKDLBl4A= +golang.org/x/oauth2 v0.0.0-20211104180415-d3ed0bb246c8/go.mod h1:KelEdhl1UZF7XfJ4dDtk6s++YSgaE7mD/BuKKDLBl4A= +golang.org/x/oauth2 v0.0.0-20220223155221-ee480838109b/go.mod h1:DAh4E804XQdzx2j+YRIaUnCqCV2RuMz24cGBJ5QYIrc= +golang.org/x/oauth2 v0.0.0-20220309155454-6242fa91716a/go.mod h1:DAh4E804XQdzx2j+YRIaUnCqCV2RuMz24cGBJ5QYIrc= +golang.org/x/oauth2 v0.0.0-20220411215720-9780585627b5/go.mod h1:DAh4E804XQdzx2j+YRIaUnCqCV2RuMz24cGBJ5QYIrc= +golang.org/x/oauth2 v0.0.0-20220608161450-d0670ef3b1eb/go.mod h1:jaDAt6Dkxork7LmZnYtzbRWj0W47D86a3TGe0YHBvmE= +golang.org/x/oauth2 v0.0.0-20220622183110-fd043fe589d2/go.mod h1:jaDAt6Dkxork7LmZnYtzbRWj0W47D86a3TGe0YHBvmE= +golang.org/x/oauth2 v0.0.0-20220822191816-0ebed06d0094/go.mod h1:h4gKUeWbJ4rQPri7E0u6Gs4e9Ri2zaLxzw5DI5XGrYg= +golang.org/x/oauth2 v0.0.0-20220909003341-f21342109be1/go.mod h1:h4gKUeWbJ4rQPri7E0u6Gs4e9Ri2zaLxzw5DI5XGrYg= +golang.org/x/oauth2 v0.0.0-20221014153046-6fdb5e3db783/go.mod h1:h4gKUeWbJ4rQPri7E0u6Gs4e9Ri2zaLxzw5DI5XGrYg= +golang.org/x/oauth2 v0.1.0/go.mod h1:G9FE4dLTsbXUu90h/Pf85g4w1D+SSAgR+q46nJZ8M4A= +golang.org/x/oauth2 v0.8.0 h1:6dkIjl3j3LtZ/O3sTgZTMsLKSftL/B8Zgq4huOIIUu8= +golang.org/x/oauth2 v0.8.0/go.mod h1:yr7u4HXZRm1R1kBWqr/xKNqewf0plRYoB7sla+BCIXE= +golang.org/x/sync v0.0.0-20180314180146-1d60e4601c6f/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= +golang.org/x/sync v0.0.0-20181108010431-42b317875d0f/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= +golang.org/x/sync v0.0.0-20181221193216-37e7f081c4d4/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= +golang.org/x/sync v0.0.0-20190227155943-e225da77a7e6/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= +golang.org/x/sync v0.0.0-20190423024810-112230192c58/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= +golang.org/x/sync v0.0.0-20190911185100-cd5d95a43a6e/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= +golang.org/x/sync v0.0.0-20200317015054-43a5402ce75a/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= +golang.org/x/sync v0.0.0-20200625203802-6e8e738ad208/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= +golang.org/x/sync v0.0.0-20201020160332-67f06af15bc9/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= +golang.org/x/sync v0.0.0-20201207232520-09787c993a3a/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= +golang.org/x/sync v0.0.0-20210220032951-036812b2e83c/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= +golang.org/x/sync v0.0.0-20220601150217-0de741cfad7f/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= +golang.org/x/sync v0.0.0-20220722155255-886fb9371eb4/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= +golang.org/x/sync v0.0.0-20220929204114-8fcdb60fdcc0/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= +golang.org/x/sys v0.0.0-20180830151530-49385e6e1522/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY= +golang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY= +golang.org/x/sys v0.0.0-20190312061237-fead79001313/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20190412213103-97732733099d/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20190502145724-3ef323f4f1fd/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20190502175342-a43fa875dd82/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20190507160741-ecd444e8653b/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20190606165138-5da285871e9c/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20190624142023-c5567b49c5d0/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20190726091711-fc99dfbffb4e/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20191001151750-bb3f8db39f24/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20191204072324-ce4227a45e2e/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20191228213918-04cbcbbfeed8/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20200113162924-86b910548bc1/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20200122134326-e047566fdf82/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20200202164722-d101bd2416d5/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20200212091648-12a6c2dcc1e4/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20200223170610-d5e6a3e2c0ae/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20200302150141-5c8b2ff67527/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20200323222414-85ca7c5b95cd/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20200331124033-c3d80250170d/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20200501052902-10377860bb8e/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20200511232937-7e40ca221e25/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20200515095857-1151b9dac4a9/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20200523222454-059865788121/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20200803210538-64077c9b5642/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20200905004654-be1d3432aa8f/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20200930185726-fdedc70b468f/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20201119102817-f84b799fce68/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20201201145000-ef89a241ccb3/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20210104204734-6f8348627aad/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20210119212857-b64e53b001e4/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20210220050731-9a76102bfb43/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20210305230114-8fe3ee5dd75b/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20210315160823-c6e025ad8005/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20210320140829-1e4c9ba3b0c4/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20210330210617-4fbd30eecc44/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20210423082822-04245dca01da/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20210510120138-977fb7262007/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.0.0-20210514084401-e8d321eab015/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.0.0-20210603125802-9665404d3644/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.0.0-20210615035016-665e8c7367d1/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.0.0-20210616094352-59db8d763f22/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.0.0-20210630005230-0f9fa26af87c/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.0.0-20210806184541-e5e7981a1069/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.0.0-20210823070655-63515b42dcdf/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.0.0-20210908233432-aa78b53d3365/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.0.0-20211124211545-fe61309f8881/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.0.0-20211210111614-af8b64212486/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.0.0-20211216021012-1d35b9e2eb4e/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.0.0-20220128215802-99c3d69c2c27/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.0.0-20220209214540-3681064d5158/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.0.0-20220227234510-4e6760a101f9/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.0.0-20220328115105-d36c6a25d886/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.0.0-20220412211240-33da011f77ad/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.0.0-20220502124256-b6088ccd6cba/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.0.0-20220503163025-988cb79eb6c6/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.0.0-20220520151302-bc2c85ada10a/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.0.0-20220610221304-9f5ed59c137d/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.0.0-20220615213510-4f61da869c0c/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.0.0-20220624220833-87e55d714810/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.0.0-20220722155257-8c9f86f7a55f/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.0.0-20220728004956-3c1f35247d10/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.1.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= +golang.org/x/sys v0.18.0 h1:DBdB3niSjOA/O0blCZBqDefyWNYveAYMNF1Wum0DYQ4= +golang.org/x/sys v0.18.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA= +golang.org/x/term v0.0.0-20201126162022-7de9c90e9dd1/go.mod h1:bj7SfCRtBDWHUb9snDiAeCFNEtKQo2Wmx5Cou7ajbmo= +golang.org/x/term v0.0.0-20210927222741-03fcf44c2211/go.mod h1:jbD1KX2456YbFQfuXm/mYQcufACuNUgVhRMnK/tPxf8= +golang.org/x/term v0.1.0/go.mod h1:jbD1KX2456YbFQfuXm/mYQcufACuNUgVhRMnK/tPxf8= +golang.org/x/term v0.18.0 h1:FcHjZXDMxI8mM3nwhX9HlKop4C0YQvCVCdwYl2wOtE8= +golang.org/x/term v0.18.0/go.mod h1:ILwASektA3OnRv7amZ1xhE/KTR+u50pbXfZ03+6Nx58= +golang.org/x/text v0.0.0-20170915032832-14c0d48ead0c/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ= +golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ= +golang.org/x/text v0.3.1-0.20180807135948-17ff2d5776d2/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ= +golang.org/x/text v0.3.2/go.mod h1:bEr9sfX3Q8Zfm5fL9x+3itogRgK3+ptLWKqgva+5dAk= +golang.org/x/text v0.3.3/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ= +golang.org/x/text v0.3.4/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ= +golang.org/x/text v0.3.5/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ= +golang.org/x/text v0.3.6/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ= +golang.org/x/text v0.3.7/go.mod h1:u+2+/6zg+i71rQMx5EYifcz6MCKuco9NR6JIITiCfzQ= +golang.org/x/text v0.4.0/go.mod h1:mrYo+phRRbMaCq/xk9113O4dZlRixOauAjOtrjsXDZ8= +golang.org/x/text v0.14.0 h1:ScX5w1eTa3QqT8oi6+ziP7dTV1S2+ALU0bI+0zXKWiQ= +golang.org/x/text v0.14.0/go.mod h1:18ZOQIKpY8NJVqYksKHtTdi31H5itFRjB5/qKTNYzSU= +golang.org/x/time v0.0.0-20181108054448-85acf8d2951c/go.mod h1:tRJNPiyCQ0inRvYxbN9jk5I+vvW/OXSQhTDSoE431IQ= +golang.org/x/time v0.0.0-20190308202827-9d24e82272b4/go.mod h1:tRJNPiyCQ0inRvYxbN9jk5I+vvW/OXSQhTDSoE431IQ= +golang.org/x/time v0.0.0-20191024005414-555d28b269f0/go.mod h1:tRJNPiyCQ0inRvYxbN9jk5I+vvW/OXSQhTDSoE431IQ= +golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ= +golang.org/x/tools v0.0.0-20190114222345-bf090417da8b/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ= +golang.org/x/tools v0.0.0-20190226205152-f727befe758c/go.mod h1:9Yl7xja0Znq3iFh3HoIrodX9oNMXvdceNzlUR8zjMvY= +golang.org/x/tools v0.0.0-20190311212946-11955173bddd/go.mod h1:LCzVGOaR6xXOjkQ3onu1FJEFr0SW1gC7cKk1uF8kGRs= +golang.org/x/tools v0.0.0-20190312151545-0bb0c0a6e846/go.mod h1:LCzVGOaR6xXOjkQ3onu1FJEFr0SW1gC7cKk1uF8kGRs= +golang.org/x/tools v0.0.0-20190312170243-e65039ee4138/go.mod h1:LCzVGOaR6xXOjkQ3onu1FJEFr0SW1gC7cKk1uF8kGRs= +golang.org/x/tools v0.0.0-20190425150028-36563e24a262/go.mod h1:RgjU9mgBXZiqYHBnxXauZ1Gv1EHHAz9KjViQ78xBX0Q= +golang.org/x/tools v0.0.0-20190506145303-2d16b83fe98c/go.mod h1:RgjU9mgBXZiqYHBnxXauZ1Gv1EHHAz9KjViQ78xBX0Q= +golang.org/x/tools v0.0.0-20190524140312-2c0ae7006135/go.mod h1:RgjU9mgBXZiqYHBnxXauZ1Gv1EHHAz9KjViQ78xBX0Q= +golang.org/x/tools v0.0.0-20190606124116-d0a3d012864b/go.mod h1:/rFqwRUd4F7ZHNgwSSTFct+R/Kf4OFW1sUzUTQQTgfc= +golang.org/x/tools v0.0.0-20190621195816-6e04913cbbac/go.mod h1:/rFqwRUd4F7ZHNgwSSTFct+R/Kf4OFW1sUzUTQQTgfc= +golang.org/x/tools v0.0.0-20190628153133-6cdbf07be9d0/go.mod h1:/rFqwRUd4F7ZHNgwSSTFct+R/Kf4OFW1sUzUTQQTgfc= +golang.org/x/tools v0.0.0-20190816200558-6889da9d5479/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo= +golang.org/x/tools v0.0.0-20190911174233-4f2ddba30aff/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo= +golang.org/x/tools v0.0.0-20191012152004-8de300cfc20a/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo= +golang.org/x/tools v0.0.0-20191113191852-77e3bb0ad9e7/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo= +golang.org/x/tools v0.0.0-20191115202509-3a792d9c32b2/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo= +golang.org/x/tools v0.0.0-20191119224855-298f0cb1881e/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo= +golang.org/x/tools v0.0.0-20191125144606-a911d9008d1f/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo= +golang.org/x/tools v0.0.0-20191130070609-6e064ea0cf2d/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo= +golang.org/x/tools v0.0.0-20191216173652-a0e659d51361/go.mod h1:TB2adYChydJhpapKDTa4BR/hXlZSLoq2Wpct/0txZ28= +golang.org/x/tools v0.0.0-20191227053925-7b8e75db28f4/go.mod h1:TB2adYChydJhpapKDTa4BR/hXlZSLoq2Wpct/0txZ28= +golang.org/x/tools v0.0.0-20200117161641-43d50277825c/go.mod h1:TB2adYChydJhpapKDTa4BR/hXlZSLoq2Wpct/0txZ28= +golang.org/x/tools v0.0.0-20200122220014-bf1340f18c4a/go.mod h1:TB2adYChydJhpapKDTa4BR/hXlZSLoq2Wpct/0txZ28= +golang.org/x/tools v0.0.0-20200130002326-2f3ba24bd6e7/go.mod h1:TB2adYChydJhpapKDTa4BR/hXlZSLoq2Wpct/0txZ28= +golang.org/x/tools v0.0.0-20200204074204-1cc6d1ef6c74/go.mod h1:TB2adYChydJhpapKDTa4BR/hXlZSLoq2Wpct/0txZ28= +golang.org/x/tools v0.0.0-20200207183749-b753a1ba74fa/go.mod h1:TB2adYChydJhpapKDTa4BR/hXlZSLoq2Wpct/0txZ28= +golang.org/x/tools v0.0.0-20200212150539-ea181f53ac56/go.mod h1:TB2adYChydJhpapKDTa4BR/hXlZSLoq2Wpct/0txZ28= +golang.org/x/tools v0.0.0-20200224181240-023911ca70b2/go.mod h1:TB2adYChydJhpapKDTa4BR/hXlZSLoq2Wpct/0txZ28= +golang.org/x/tools v0.0.0-20200227222343-706bc42d1f0d/go.mod h1:TB2adYChydJhpapKDTa4BR/hXlZSLoq2Wpct/0txZ28= +golang.org/x/tools v0.0.0-20200304193943-95d2e580d8eb/go.mod h1:o4KQGtdN14AW+yjsvvwRTJJuXz8XRtIHtEnmAXLyFUw= +golang.org/x/tools v0.0.0-20200312045724-11d5b4c81c7d/go.mod h1:o4KQGtdN14AW+yjsvvwRTJJuXz8XRtIHtEnmAXLyFUw= +golang.org/x/tools v0.0.0-20200331025713-a30bf2db82d4/go.mod h1:Sl4aGygMT6LrqrWclx+PTx3U+LnKx/seiNR+3G19Ar8= +golang.org/x/tools v0.0.0-20200501065659-ab2804fb9c9d/go.mod h1:EkVYQZoAsY45+roYkvgYkIh4xh/qjgUK9TdY2XT94GE= +golang.org/x/tools v0.0.0-20200512131952-2bc93b1c0c88/go.mod h1:EkVYQZoAsY45+roYkvgYkIh4xh/qjgUK9TdY2XT94GE= +golang.org/x/tools v0.0.0-20200515010526-7d3b6ebf133d/go.mod h1:EkVYQZoAsY45+roYkvgYkIh4xh/qjgUK9TdY2XT94GE= +golang.org/x/tools v0.0.0-20200618134242-20370b0cb4b2/go.mod h1:EkVYQZoAsY45+roYkvgYkIh4xh/qjgUK9TdY2XT94GE= +golang.org/x/tools v0.0.0-20200729194436-6467de6f59a7/go.mod h1:njjCfa9FT2d7l9Bc6FUM5FLjQPp3cFF28FI3qnDFljA= +golang.org/x/tools v0.0.0-20200804011535-6c149bb5ef0d/go.mod h1:njjCfa9FT2d7l9Bc6FUM5FLjQPp3cFF28FI3qnDFljA= +golang.org/x/tools v0.0.0-20200825202427-b303f430e36d/go.mod h1:njjCfa9FT2d7l9Bc6FUM5FLjQPp3cFF28FI3qnDFljA= +golang.org/x/tools v0.0.0-20200904185747-39188db58858/go.mod h1:Cj7w3i3Rnn0Xh82ur9kSqwfTHTeVxaDqrfMjpcNT6bE= +golang.org/x/tools v0.0.0-20201110124207-079ba7bd75cd/go.mod h1:emZCQorbCU4vsT4fOWvOPXz4eW1wZW4PmDk9uLelYpA= +golang.org/x/tools v0.0.0-20201201161351-ac6f37ff4c2a/go.mod h1:emZCQorbCU4vsT4fOWvOPXz4eW1wZW4PmDk9uLelYpA= +golang.org/x/tools v0.0.0-20201208233053-a543418bbed2/go.mod h1:emZCQorbCU4vsT4fOWvOPXz4eW1wZW4PmDk9uLelYpA= +golang.org/x/tools v0.0.0-20210105154028-b0ab187a4818/go.mod h1:emZCQorbCU4vsT4fOWvOPXz4eW1wZW4PmDk9uLelYpA= +golang.org/x/tools v0.1.0/go.mod h1:xkSsbof2nBLbhDlRMhhhyNLN/zl3eTqcnHD5viDpcZ0= +golang.org/x/tools v0.1.1/go.mod h1:o0xws9oXOQQZyjljx8fwUC0k7L1pTE6eaCbjGeHmOkk= +golang.org/x/tools v0.1.2/go.mod h1:o0xws9oXOQQZyjljx8fwUC0k7L1pTE6eaCbjGeHmOkk= +golang.org/x/tools v0.1.3/go.mod h1:o0xws9oXOQQZyjljx8fwUC0k7L1pTE6eaCbjGeHmOkk= +golang.org/x/tools v0.1.4/go.mod h1:o0xws9oXOQQZyjljx8fwUC0k7L1pTE6eaCbjGeHmOkk= +golang.org/x/tools v0.1.5/go.mod h1:o0xws9oXOQQZyjljx8fwUC0k7L1pTE6eaCbjGeHmOkk= +golang.org/x/tools v0.1.12/go.mod h1:hNGJHUnrk76NpqgfD5Aqm5Crs+Hm0VOH/i9J2+nxYbc= +golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= +golang.org/x/xerrors v0.0.0-20191011141410-1b5146add898/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= +golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= +golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= +golang.org/x/xerrors v0.0.0-20220411194840-2f41105eb62f/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= +golang.org/x/xerrors v0.0.0-20220517211312-f3a8303e98df/go.mod h1:K8+ghG5WaK9qNqU5K3HdILfMLy1f3aNYFI/wnl100a8= +golang.org/x/xerrors v0.0.0-20220609144429-65e65417b02f/go.mod h1:K8+ghG5WaK9qNqU5K3HdILfMLy1f3aNYFI/wnl100a8= +golang.org/x/xerrors v0.0.0-20220907171357-04be3eba64a2 h1:H2TDz8ibqkAF6YGhCdN3jS9O0/s90v0rJh3X/OLHEUk= +golang.org/x/xerrors v0.0.0-20220907171357-04be3eba64a2/go.mod h1:K8+ghG5WaK9qNqU5K3HdILfMLy1f3aNYFI/wnl100a8= +google.golang.org/api v0.4.0/go.mod h1:8k5glujaEP+g9n7WNsDg8QP6cUVNI86fCNMcbazEtwE= +google.golang.org/api v0.7.0/go.mod h1:WtwebWUNSVBH/HAw79HIFXZNqEvBhG+Ra+ax0hx3E3M= +google.golang.org/api v0.8.0/go.mod h1:o4eAsZoiT+ibD93RtjEohWalFOjRDx6CVaqeizhEnKg= +google.golang.org/api v0.9.0/go.mod h1:o4eAsZoiT+ibD93RtjEohWalFOjRDx6CVaqeizhEnKg= +google.golang.org/api v0.13.0/go.mod h1:iLdEw5Ide6rF15KTC1Kkl0iskquN2gFfn9o9XIsbkAI= +google.golang.org/api v0.14.0/go.mod h1:iLdEw5Ide6rF15KTC1Kkl0iskquN2gFfn9o9XIsbkAI= +google.golang.org/api v0.15.0/go.mod h1:iLdEw5Ide6rF15KTC1Kkl0iskquN2gFfn9o9XIsbkAI= +google.golang.org/api v0.17.0/go.mod h1:BwFmGc8tA3vsd7r/7kR8DY7iEEGSU04BFxCo5jP/sfE= +google.golang.org/api v0.18.0/go.mod h1:BwFmGc8tA3vsd7r/7kR8DY7iEEGSU04BFxCo5jP/sfE= +google.golang.org/api v0.19.0/go.mod h1:BwFmGc8tA3vsd7r/7kR8DY7iEEGSU04BFxCo5jP/sfE= +google.golang.org/api v0.20.0/go.mod h1:BwFmGc8tA3vsd7r/7kR8DY7iEEGSU04BFxCo5jP/sfE= +google.golang.org/api v0.22.0/go.mod h1:BwFmGc8tA3vsd7r/7kR8DY7iEEGSU04BFxCo5jP/sfE= +google.golang.org/api v0.24.0/go.mod h1:lIXQywCXRcnZPGlsd8NbLnOjtAoL6em04bJ9+z0MncE= +google.golang.org/api v0.28.0/go.mod h1:lIXQywCXRcnZPGlsd8NbLnOjtAoL6em04bJ9+z0MncE= +google.golang.org/api v0.29.0/go.mod h1:Lcubydp8VUV7KeIHD9z2Bys/sm/vGKnG1UHuDBSrHWM= +google.golang.org/api v0.30.0/go.mod h1:QGmEvQ87FHZNiUVJkT14jQNYJ4ZJjdRF23ZXz5138Fc= +google.golang.org/api v0.35.0/go.mod h1:/XrVsuzM0rZmrsbjJutiuftIzeuTQcEeaYcSk/mQ1dg= +google.golang.org/api v0.36.0/go.mod h1:+z5ficQTmoYpPn8LCUNVpK5I7hwkpjbcgqA7I34qYtE= +google.golang.org/api v0.40.0/go.mod h1:fYKFpnQN0DsDSKRVRcQSDQNtqWPfM9i+zNPxepjRCQ8= +google.golang.org/api v0.41.0/go.mod h1:RkxM5lITDfTzmyKFPt+wGrCJbVfniCr2ool8kTBzRTU= +google.golang.org/api v0.43.0/go.mod h1:nQsDGjRXMo4lvh5hP0TKqF244gqhGcr/YSIykhUk/94= +google.golang.org/api v0.47.0/go.mod h1:Wbvgpq1HddcWVtzsVLyfLp8lDg6AA241LmgIL59tHXo= +google.golang.org/api v0.48.0/go.mod h1:71Pr1vy+TAZRPkPs/xlCf5SsU8WjuAWv1Pfjbtukyy4= +google.golang.org/api v0.50.0/go.mod h1:4bNT5pAuq5ji4SRZm+5QIkjny9JAyVD/3gaSihNefaw= +google.golang.org/api v0.51.0/go.mod h1:t4HdrdoNgyN5cbEfm7Lum0lcLDLiise1F8qDKX00sOU= +google.golang.org/api v0.54.0/go.mod h1:7C4bFFOvVDGXjfDTAsgGwDgAxRDeQ4X8NvUedIt6z3k= +google.golang.org/api v0.55.0/go.mod h1:38yMfeP1kfjsl8isn0tliTjIb1rJXcQi4UXlbqivdVE= +google.golang.org/api v0.56.0/go.mod h1:38yMfeP1kfjsl8isn0tliTjIb1rJXcQi4UXlbqivdVE= +google.golang.org/api v0.57.0/go.mod h1:dVPlbZyBo2/OjBpmvNdpn2GRm6rPy75jyU7bmhdrMgI= +google.golang.org/api v0.61.0/go.mod h1:xQRti5UdCmoCEqFxcz93fTl338AVqDgyaDRuOZ3hg9I= +google.golang.org/api v0.63.0/go.mod h1:gs4ij2ffTRXwuzzgJl/56BdwJaA194ijkfn++9tDuPo= +google.golang.org/api v0.67.0/go.mod h1:ShHKP8E60yPsKNw/w8w+VYaj9H6buA5UqDp8dhbQZ6g= +google.golang.org/api v0.70.0/go.mod h1:Bs4ZM2HGifEvXwd50TtW70ovgJffJYw2oRCOFU/SkfA= +google.golang.org/api v0.71.0/go.mod h1:4PyU6e6JogV1f9eA4voyrTY2batOLdgZ5qZ5HOCc4j8= +google.golang.org/api v0.74.0/go.mod h1:ZpfMZOVRMywNyvJFeqL9HRWBgAuRfSjJFpe9QtRRyDs= +google.golang.org/api v0.75.0/go.mod h1:pU9QmyHLnzlpar1Mjt4IbapUCy8J+6HD6GeELN69ljA= +google.golang.org/api v0.77.0/go.mod h1:pU9QmyHLnzlpar1Mjt4IbapUCy8J+6HD6GeELN69ljA= +google.golang.org/api v0.78.0/go.mod h1:1Sg78yoMLOhlQTeF+ARBoytAcH1NNyyl390YMy6rKmw= +google.golang.org/api v0.80.0/go.mod h1:xY3nI94gbvBrE0J6NHXhxOmW97HG7Khjkku6AFB3Hyg= +google.golang.org/api v0.84.0/go.mod h1:NTsGnUFJMYROtiquksZHBWtHfeMC7iYthki7Eq3pa8o= +google.golang.org/api v0.85.0/go.mod h1:AqZf8Ep9uZ2pyTvgL+x0D3Zt0eoT9b5E8fmzfu6FO2g= +google.golang.org/api v0.90.0/go.mod h1:+Sem1dnrKlrXMR/X0bPnMWyluQe4RsNoYfmNLhOIkzw= +google.golang.org/api v0.93.0/go.mod h1:+Sem1dnrKlrXMR/X0bPnMWyluQe4RsNoYfmNLhOIkzw= +google.golang.org/api v0.95.0/go.mod h1:eADj+UBuxkh5zlrSntJghuNeg8HwQ1w5lTKkuqaETEI= +google.golang.org/api v0.96.0/go.mod h1:w7wJQLTM+wvQpNf5JyEcBoxK0RH7EDrh/L4qfsuJ13s= +google.golang.org/api v0.97.0/go.mod h1:w7wJQLTM+wvQpNf5JyEcBoxK0RH7EDrh/L4qfsuJ13s= +google.golang.org/api v0.98.0/go.mod h1:w7wJQLTM+wvQpNf5JyEcBoxK0RH7EDrh/L4qfsuJ13s= +google.golang.org/api v0.100.0/go.mod h1:ZE3Z2+ZOr87Rx7dqFsdRQkRBk36kDtp/h+QpHbB7a70= +google.golang.org/api v0.114.0 h1:1xQPji6cO2E2vLiI+C/XiFAnsn1WV3mjaEwGLhi3grE= +google.golang.org/api v0.114.0/go.mod h1:ifYI2ZsFK6/uGddGfAD5BMxlnkBqCmqHSDUVi45N5Yg= +google.golang.org/appengine v1.1.0/go.mod h1:EbEs0AVv82hx2wNQdGPgUI5lhzA/G0D9YwlJXL52JkM= +google.golang.org/appengine v1.4.0/go.mod h1:xpcJRLb0r/rnEns0DIKYYv+WjYCduHsrkT7/EB5XEv4= +google.golang.org/appengine v1.5.0/go.mod h1:xpcJRLb0r/rnEns0DIKYYv+WjYCduHsrkT7/EB5XEv4= +google.golang.org/appengine v1.6.1/go.mod h1:i06prIuMbXzDqacNJfV5OdTW448YApPu5ww/cMBSeb0= +google.golang.org/appengine v1.6.5/go.mod h1:8WjMMxjGQR8xUklV/ARdw2HLXBOI7O7uCIDZVag1xfc= +google.golang.org/appengine v1.6.6/go.mod h1:8WjMMxjGQR8xUklV/ARdw2HLXBOI7O7uCIDZVag1xfc= +google.golang.org/appengine v1.6.7 h1:FZR1q0exgwxzPzp/aF+VccGrSfxfPpkBqjIIEq3ru6c= +google.golang.org/appengine v1.6.7/go.mod h1:8WjMMxjGQR8xUklV/ARdw2HLXBOI7O7uCIDZVag1xfc= +google.golang.org/genproto v0.0.0-20180817151627-c66870c02cf8/go.mod h1:JiN7NxoALGmiZfu7CAH4rXhgtRTLTxftemlI0sWmxmc= +google.golang.org/genproto v0.0.0-20190307195333-5fe7a883aa19/go.mod h1:VzzqZJRnGkLBvHegQrXjBqPurQTc5/KpmUdxsrq26oE= +google.golang.org/genproto v0.0.0-20190418145605-e7d98fc518a7/go.mod h1:VzzqZJRnGkLBvHegQrXjBqPurQTc5/KpmUdxsrq26oE= +google.golang.org/genproto v0.0.0-20190425155659-357c62f0e4bb/go.mod h1:VzzqZJRnGkLBvHegQrXjBqPurQTc5/KpmUdxsrq26oE= +google.golang.org/genproto v0.0.0-20190502173448-54afdca5d873/go.mod h1:VzzqZJRnGkLBvHegQrXjBqPurQTc5/KpmUdxsrq26oE= +google.golang.org/genproto v0.0.0-20190801165951-fa694d86fc64/go.mod h1:DMBHOl98Agz4BDEuKkezgsaosCRResVns1a3J2ZsMNc= +google.golang.org/genproto v0.0.0-20190819201941-24fa4b261c55/go.mod h1:DMBHOl98Agz4BDEuKkezgsaosCRResVns1a3J2ZsMNc= +google.golang.org/genproto v0.0.0-20190911173649-1774047e7e51/go.mod h1:IbNlFCBrqXvoKpeg0TB2l7cyZUmoaFKYIwrEpbDKLA8= +google.golang.org/genproto v0.0.0-20191108220845-16a3f7862a1a/go.mod h1:n3cpQtvxv34hfy77yVDNjmbRyujviMdxYliBSkLhpCc= +google.golang.org/genproto v0.0.0-20191115194625-c23dd37a84c9/go.mod h1:n3cpQtvxv34hfy77yVDNjmbRyujviMdxYliBSkLhpCc= +google.golang.org/genproto v0.0.0-20191216164720-4f79533eabd1/go.mod h1:n3cpQtvxv34hfy77yVDNjmbRyujviMdxYliBSkLhpCc= +google.golang.org/genproto v0.0.0-20191230161307-f3c370f40bfb/go.mod h1:n3cpQtvxv34hfy77yVDNjmbRyujviMdxYliBSkLhpCc= +google.golang.org/genproto v0.0.0-20200115191322-ca5a22157cba/go.mod h1:n3cpQtvxv34hfy77yVDNjmbRyujviMdxYliBSkLhpCc= +google.golang.org/genproto v0.0.0-20200122232147-0452cf42e150/go.mod h1:n3cpQtvxv34hfy77yVDNjmbRyujviMdxYliBSkLhpCc= +google.golang.org/genproto v0.0.0-20200204135345-fa8e72b47b90/go.mod h1:GmwEX6Z4W5gMy59cAlVYjN9JhxgbQH6Gn+gFDQe2lzA= +google.golang.org/genproto v0.0.0-20200212174721-66ed5ce911ce/go.mod h1:55QSHmfGQM9UVYDPBsyGGes0y52j32PQ3BqQfXhyH3c= +google.golang.org/genproto v0.0.0-20200224152610-e50cd9704f63/go.mod h1:55QSHmfGQM9UVYDPBsyGGes0y52j32PQ3BqQfXhyH3c= +google.golang.org/genproto v0.0.0-20200228133532-8c2c7df3a383/go.mod h1:55QSHmfGQM9UVYDPBsyGGes0y52j32PQ3BqQfXhyH3c= +google.golang.org/genproto v0.0.0-20200305110556-506484158171/go.mod h1:55QSHmfGQM9UVYDPBsyGGes0y52j32PQ3BqQfXhyH3c= +google.golang.org/genproto v0.0.0-20200312145019-da6875a35672/go.mod h1:55QSHmfGQM9UVYDPBsyGGes0y52j32PQ3BqQfXhyH3c= +google.golang.org/genproto v0.0.0-20200331122359-1ee6d9798940/go.mod h1:55QSHmfGQM9UVYDPBsyGGes0y52j32PQ3BqQfXhyH3c= +google.golang.org/genproto v0.0.0-20200430143042-b979b6f78d84/go.mod h1:55QSHmfGQM9UVYDPBsyGGes0y52j32PQ3BqQfXhyH3c= +google.golang.org/genproto v0.0.0-20200511104702-f5ebc3bea380/go.mod h1:55QSHmfGQM9UVYDPBsyGGes0y52j32PQ3BqQfXhyH3c= +google.golang.org/genproto v0.0.0-20200513103714-09dca8ec2884/go.mod h1:55QSHmfGQM9UVYDPBsyGGes0y52j32PQ3BqQfXhyH3c= +google.golang.org/genproto v0.0.0-20200515170657-fc4c6c6a6587/go.mod h1:YsZOwe1myG/8QRHRsmBRE1LrgQY60beZKjly0O1fX9U= +google.golang.org/genproto v0.0.0-20200526211855-cb27e3aa2013/go.mod h1:NbSheEEYHJ7i3ixzK3sjbqSGDJWnxyFXZblF3eUsNvo= +google.golang.org/genproto v0.0.0-20200618031413-b414f8b61790/go.mod h1:jDfRM7FcilCzHH/e9qn6dsT145K34l5v+OpcnNgKAAA= +google.golang.org/genproto v0.0.0-20200729003335-053ba62fc06f/go.mod h1:FWY/as6DDZQgahTzZj3fqbO1CbirC29ZNUFHwi0/+no= +google.golang.org/genproto v0.0.0-20200804131852-c06518451d9c/go.mod h1:FWY/as6DDZQgahTzZj3fqbO1CbirC29ZNUFHwi0/+no= +google.golang.org/genproto v0.0.0-20200825200019-8632dd797987/go.mod h1:FWY/as6DDZQgahTzZj3fqbO1CbirC29ZNUFHwi0/+no= +google.golang.org/genproto v0.0.0-20200904004341-0bd0a958aa1d/go.mod h1:FWY/as6DDZQgahTzZj3fqbO1CbirC29ZNUFHwi0/+no= +google.golang.org/genproto v0.0.0-20201109203340-2640f1f9cdfb/go.mod h1:FWY/as6DDZQgahTzZj3fqbO1CbirC29ZNUFHwi0/+no= +google.golang.org/genproto v0.0.0-20201201144952-b05cb90ed32e/go.mod h1:FWY/as6DDZQgahTzZj3fqbO1CbirC29ZNUFHwi0/+no= +google.golang.org/genproto v0.0.0-20201210142538-e3217bee35cc/go.mod h1:FWY/as6DDZQgahTzZj3fqbO1CbirC29ZNUFHwi0/+no= +google.golang.org/genproto v0.0.0-20201214200347-8c77b98c765d/go.mod h1:FWY/as6DDZQgahTzZj3fqbO1CbirC29ZNUFHwi0/+no= +google.golang.org/genproto v0.0.0-20210222152913-aa3ee6e6a81c/go.mod h1:FWY/as6DDZQgahTzZj3fqbO1CbirC29ZNUFHwi0/+no= +google.golang.org/genproto v0.0.0-20210303154014-9728d6b83eeb/go.mod h1:FWY/as6DDZQgahTzZj3fqbO1CbirC29ZNUFHwi0/+no= +google.golang.org/genproto v0.0.0-20210310155132-4ce2db91004e/go.mod h1:FWY/as6DDZQgahTzZj3fqbO1CbirC29ZNUFHwi0/+no= +google.golang.org/genproto v0.0.0-20210319143718-93e7006c17a6/go.mod h1:FWY/as6DDZQgahTzZj3fqbO1CbirC29ZNUFHwi0/+no= +google.golang.org/genproto v0.0.0-20210329143202-679c6ae281ee/go.mod h1:9lPAdzaEmUacj36I+k7YKbEc5CXzPIeORRgDAUOu28A= +google.golang.org/genproto v0.0.0-20210402141018-6c239bbf2bb1/go.mod h1:9lPAdzaEmUacj36I+k7YKbEc5CXzPIeORRgDAUOu28A= +google.golang.org/genproto v0.0.0-20210513213006-bf773b8c8384/go.mod h1:P3QM42oQyzQSnHPnZ/vqoCdDmzH28fzWByN9asMeM8A= +google.golang.org/genproto v0.0.0-20210602131652-f16073e35f0c/go.mod h1:UODoCrxHCcBojKKwX1terBiRUaqAsFqJiF615XL43r0= +google.golang.org/genproto v0.0.0-20210604141403-392c879c8b08/go.mod h1:UODoCrxHCcBojKKwX1terBiRUaqAsFqJiF615XL43r0= +google.golang.org/genproto v0.0.0-20210608205507-b6d2f5bf0d7d/go.mod h1:UODoCrxHCcBojKKwX1terBiRUaqAsFqJiF615XL43r0= +google.golang.org/genproto v0.0.0-20210624195500-8bfb893ecb84/go.mod h1:SzzZ/N+nwJDaO1kznhnlzqS8ocJICar6hYhVyhi++24= +google.golang.org/genproto v0.0.0-20210713002101-d411969a0d9a/go.mod h1:AxrInvYm1dci+enl5hChSFPOmmUF1+uAa/UsgNRWd7k= +google.golang.org/genproto v0.0.0-20210716133855-ce7ef5c701ea/go.mod h1:AxrInvYm1dci+enl5hChSFPOmmUF1+uAa/UsgNRWd7k= +google.golang.org/genproto v0.0.0-20210728212813-7823e685a01f/go.mod h1:ob2IJxKrgPT52GcgX759i1sleT07tiKowYBGbczaW48= +google.golang.org/genproto v0.0.0-20210805201207-89edb61ffb67/go.mod h1:ob2IJxKrgPT52GcgX759i1sleT07tiKowYBGbczaW48= +google.golang.org/genproto v0.0.0-20210813162853-db860fec028c/go.mod h1:cFeNkxwySK631ADgubI+/XFU/xp8FD5KIVV4rj8UC5w= +google.golang.org/genproto v0.0.0-20210821163610-241b8fcbd6c8/go.mod h1:eFjDcFEctNawg4eG61bRv87N7iHBWyVhJu7u1kqDUXY= +google.golang.org/genproto v0.0.0-20210828152312-66f60bf46e71/go.mod h1:eFjDcFEctNawg4eG61bRv87N7iHBWyVhJu7u1kqDUXY= +google.golang.org/genproto v0.0.0-20210831024726-fe130286e0e2/go.mod h1:eFjDcFEctNawg4eG61bRv87N7iHBWyVhJu7u1kqDUXY= +google.golang.org/genproto v0.0.0-20210903162649-d08c68adba83/go.mod h1:eFjDcFEctNawg4eG61bRv87N7iHBWyVhJu7u1kqDUXY= +google.golang.org/genproto v0.0.0-20210909211513-a8c4777a87af/go.mod h1:eFjDcFEctNawg4eG61bRv87N7iHBWyVhJu7u1kqDUXY= +google.golang.org/genproto v0.0.0-20210924002016-3dee208752a0/go.mod h1:5CzLGKJ67TSI2B9POpiiyGha0AjJvZIUgRMt1dSmuhc= +google.golang.org/genproto v0.0.0-20211118181313-81c1377c94b1/go.mod h1:5CzLGKJ67TSI2B9POpiiyGha0AjJvZIUgRMt1dSmuhc= +google.golang.org/genproto v0.0.0-20211206160659-862468c7d6e0/go.mod h1:5CzLGKJ67TSI2B9POpiiyGha0AjJvZIUgRMt1dSmuhc= +google.golang.org/genproto v0.0.0-20211208223120-3a66f561d7aa/go.mod h1:5CzLGKJ67TSI2B9POpiiyGha0AjJvZIUgRMt1dSmuhc= +google.golang.org/genproto v0.0.0-20211221195035-429b39de9b1c/go.mod h1:5CzLGKJ67TSI2B9POpiiyGha0AjJvZIUgRMt1dSmuhc= +google.golang.org/genproto v0.0.0-20220126215142-9970aeb2e350/go.mod h1:5CzLGKJ67TSI2B9POpiiyGha0AjJvZIUgRMt1dSmuhc= +google.golang.org/genproto v0.0.0-20220207164111-0872dc986b00/go.mod h1:5CzLGKJ67TSI2B9POpiiyGha0AjJvZIUgRMt1dSmuhc= +google.golang.org/genproto v0.0.0-20220218161850-94dd64e39d7c/go.mod h1:kGP+zUP2Ddo0ayMi4YuN7C3WZyJvGLZRh8Z5wnAqvEI= +google.golang.org/genproto v0.0.0-20220222213610-43724f9ea8cf/go.mod h1:kGP+zUP2Ddo0ayMi4YuN7C3WZyJvGLZRh8Z5wnAqvEI= +google.golang.org/genproto v0.0.0-20220304144024-325a89244dc8/go.mod h1:kGP+zUP2Ddo0ayMi4YuN7C3WZyJvGLZRh8Z5wnAqvEI= +google.golang.org/genproto v0.0.0-20220310185008-1973136f34c6/go.mod h1:kGP+zUP2Ddo0ayMi4YuN7C3WZyJvGLZRh8Z5wnAqvEI= +google.golang.org/genproto v0.0.0-20220324131243-acbaeb5b85eb/go.mod h1:hAL49I2IFola2sVEjAn7MEwsja0xp51I0tlGAf9hz4E= +google.golang.org/genproto v0.0.0-20220407144326-9054f6ed7bac/go.mod h1:8w6bsBMX6yCPbAVTeqQHvzxW0EIFigd5lZyahWgyfDo= +google.golang.org/genproto v0.0.0-20220413183235-5e96e2839df9/go.mod h1:8w6bsBMX6yCPbAVTeqQHvzxW0EIFigd5lZyahWgyfDo= +google.golang.org/genproto v0.0.0-20220414192740-2d67ff6cf2b4/go.mod h1:8w6bsBMX6yCPbAVTeqQHvzxW0EIFigd5lZyahWgyfDo= +google.golang.org/genproto v0.0.0-20220421151946-72621c1f0bd3/go.mod h1:8w6bsBMX6yCPbAVTeqQHvzxW0EIFigd5lZyahWgyfDo= +google.golang.org/genproto v0.0.0-20220429170224-98d788798c3e/go.mod h1:8w6bsBMX6yCPbAVTeqQHvzxW0EIFigd5lZyahWgyfDo= +google.golang.org/genproto v0.0.0-20220502173005-c8bf987b8c21/go.mod h1:RAyBrSAP7Fh3Nc84ghnVLDPuV51xc9agzmm4Ph6i0Q4= +google.golang.org/genproto v0.0.0-20220505152158-f39f71e6c8f3/go.mod h1:RAyBrSAP7Fh3Nc84ghnVLDPuV51xc9agzmm4Ph6i0Q4= +google.golang.org/genproto v0.0.0-20220518221133-4f43b3371335/go.mod h1:RAyBrSAP7Fh3Nc84ghnVLDPuV51xc9agzmm4Ph6i0Q4= +google.golang.org/genproto v0.0.0-20220523171625-347a074981d8/go.mod h1:RAyBrSAP7Fh3Nc84ghnVLDPuV51xc9agzmm4Ph6i0Q4= +google.golang.org/genproto v0.0.0-20220608133413-ed9918b62aac/go.mod h1:KEWEmljWE5zPzLBa/oHl6DaEt9LmfH6WtH1OHIvleBA= +google.golang.org/genproto v0.0.0-20220616135557-88e70c0c3a90/go.mod h1:KEWEmljWE5zPzLBa/oHl6DaEt9LmfH6WtH1OHIvleBA= +google.golang.org/genproto v0.0.0-20220617124728-180714bec0ad/go.mod h1:KEWEmljWE5zPzLBa/oHl6DaEt9LmfH6WtH1OHIvleBA= +google.golang.org/genproto v0.0.0-20220624142145-8cd45d7dbd1f/go.mod h1:KEWEmljWE5zPzLBa/oHl6DaEt9LmfH6WtH1OHIvleBA= +google.golang.org/genproto v0.0.0-20220628213854-d9e0b6570c03/go.mod h1:KEWEmljWE5zPzLBa/oHl6DaEt9LmfH6WtH1OHIvleBA= +google.golang.org/genproto v0.0.0-20220722212130-b98a9ff5e252/go.mod h1:GkXuJDJ6aQ7lnJcRF+SJVgFdQhypqgl3LB1C9vabdRE= +google.golang.org/genproto v0.0.0-20220801145646-83ce21fca29f/go.mod h1:iHe1svFLAZg9VWz891+QbRMwUv9O/1Ww+/mngYeThbc= +google.golang.org/genproto v0.0.0-20220815135757-37a418bb8959/go.mod h1:dbqgFATTzChvnt+ujMdZwITVAJHFtfyN1qUhDqEiIlk= +google.golang.org/genproto v0.0.0-20220817144833-d7fd3f11b9b1/go.mod h1:dbqgFATTzChvnt+ujMdZwITVAJHFtfyN1qUhDqEiIlk= +google.golang.org/genproto v0.0.0-20220822174746-9e6da59bd2fc/go.mod h1:dbqgFATTzChvnt+ujMdZwITVAJHFtfyN1qUhDqEiIlk= +google.golang.org/genproto v0.0.0-20220829144015-23454907ede3/go.mod h1:dbqgFATTzChvnt+ujMdZwITVAJHFtfyN1qUhDqEiIlk= +google.golang.org/genproto v0.0.0-20220829175752-36a9c930ecbf/go.mod h1:dbqgFATTzChvnt+ujMdZwITVAJHFtfyN1qUhDqEiIlk= +google.golang.org/genproto v0.0.0-20220913154956-18f8339a66a5/go.mod h1:0Nb8Qy+Sk5eDzHnzlStwW3itdNaWoZA5XeSG+R3JHSo= +google.golang.org/genproto v0.0.0-20220914142337-ca0e39ece12f/go.mod h1:0Nb8Qy+Sk5eDzHnzlStwW3itdNaWoZA5XeSG+R3JHSo= +google.golang.org/genproto v0.0.0-20220915135415-7fd63a7952de/go.mod h1:0Nb8Qy+Sk5eDzHnzlStwW3itdNaWoZA5XeSG+R3JHSo= +google.golang.org/genproto v0.0.0-20220916172020-2692e8806bfa/go.mod h1:0Nb8Qy+Sk5eDzHnzlStwW3itdNaWoZA5XeSG+R3JHSo= +google.golang.org/genproto v0.0.0-20220919141832-68c03719ef51/go.mod h1:0Nb8Qy+Sk5eDzHnzlStwW3itdNaWoZA5XeSG+R3JHSo= +google.golang.org/genproto v0.0.0-20220920201722-2b89144ce006/go.mod h1:ht8XFiar2npT/g4vkk7O0WYS1sHOHbdujxbEp7CJWbw= +google.golang.org/genproto v0.0.0-20220926165614-551eb538f295/go.mod h1:woMGP53BroOrRY3xTxlbr8Y3eB/nzAvvFM83q7kG2OI= +google.golang.org/genproto v0.0.0-20220926220553-6981cbe3cfce/go.mod h1:woMGP53BroOrRY3xTxlbr8Y3eB/nzAvvFM83q7kG2OI= +google.golang.org/genproto v0.0.0-20221010155953-15ba04fc1c0e/go.mod h1:3526vdqwhZAwq4wsRUaVG555sVgsNmIjRtO7t/JH29U= +google.golang.org/genproto v0.0.0-20221014173430-6e2ab493f96b/go.mod h1:1vXfmgAz9N9Jx0QA82PqRVauvCz1SGSz739p0f183jM= +google.golang.org/genproto v0.0.0-20221014213838-99cd37c6964a/go.mod h1:1vXfmgAz9N9Jx0QA82PqRVauvCz1SGSz739p0f183jM= +google.golang.org/genproto v0.0.0-20221025140454-527a21cfbd71/go.mod h1:9qHF0xnpdSfF6knlcsnpzUu5y+rpwgbvsyGAZPBMg4s= +google.golang.org/genproto v0.0.0-20230410155749-daa745c078e1 h1:KpwkzHKEF7B9Zxg18WzOa7djJ+Ha5DzthMyZYQfEn2A= +google.golang.org/genproto v0.0.0-20230410155749-daa745c078e1/go.mod h1:nKE/iIaLqn2bQwXBg8f1g2Ylh6r5MN5CmZvuzZCgsCU= +google.golang.org/grpc v1.19.0/go.mod h1:mqu4LbDTu4XGKhr4mRzUsmM4RtVoemTSY81AxZiDr8c= +google.golang.org/grpc v1.20.1/go.mod h1:10oTOabMzJvdu6/UiuZezV6QK5dSlG84ov/aaiqXj38= +google.golang.org/grpc v1.21.1/go.mod h1:oYelfM1adQP15Ek0mdvEgi9Df8B9CZIaU1084ijfRaM= +google.golang.org/grpc v1.23.0/go.mod h1:Y5yQAOtifL1yxbo5wqy6BxZv8vAUGQwXBOALyacEbxg= +google.golang.org/grpc v1.25.1/go.mod h1:c3i+UQWmh7LiEpx4sFZnkU36qjEYZ0imhYfXVyQciAY= +google.golang.org/grpc v1.26.0/go.mod h1:qbnxyOmOxrQa7FizSgH+ReBfzJrCY1pSN7KXBS8abTk= +google.golang.org/grpc v1.27.0/go.mod h1:qbnxyOmOxrQa7FizSgH+ReBfzJrCY1pSN7KXBS8abTk= +google.golang.org/grpc v1.27.1/go.mod h1:qbnxyOmOxrQa7FizSgH+ReBfzJrCY1pSN7KXBS8abTk= +google.golang.org/grpc v1.28.0/go.mod h1:rpkK4SK4GF4Ach/+MFLZUBavHOvF2JJB5uozKKal+60= +google.golang.org/grpc v1.29.1/go.mod h1:itym6AZVZYACWQqET3MqgPpjcuV5QH3BxFS3IjizoKk= +google.golang.org/grpc v1.30.0/go.mod h1:N36X2cJ7JwdamYAgDz+s+rVMFjt3numwzf/HckM8pak= +google.golang.org/grpc v1.31.0/go.mod h1:N36X2cJ7JwdamYAgDz+s+rVMFjt3numwzf/HckM8pak= +google.golang.org/grpc v1.31.1/go.mod h1:N36X2cJ7JwdamYAgDz+s+rVMFjt3numwzf/HckM8pak= +google.golang.org/grpc v1.33.1/go.mod h1:fr5YgcSWrqhRRxogOsw7RzIpsmvOZ6IcH4kBYTpR3n0= +google.golang.org/grpc v1.33.2/go.mod h1:JMHMWHQWaTccqQQlmk3MJZS+GWXOdAesneDmEnv2fbc= +google.golang.org/grpc v1.34.0/go.mod h1:WotjhfgOW/POjDeRt8vscBtXq+2VjORFy659qA51WJ8= +google.golang.org/grpc v1.35.0/go.mod h1:qjiiYl8FncCW8feJPdyg3v6XW24KsRHe+dy9BAGRRjU= +google.golang.org/grpc v1.36.0/go.mod h1:qjiiYl8FncCW8feJPdyg3v6XW24KsRHe+dy9BAGRRjU= +google.golang.org/grpc v1.36.1/go.mod h1:qjiiYl8FncCW8feJPdyg3v6XW24KsRHe+dy9BAGRRjU= +google.golang.org/grpc v1.37.0/go.mod h1:NREThFqKR1f3iQ6oBuvc5LadQuXVGo9rkm5ZGrQdJfM= +google.golang.org/grpc v1.37.1/go.mod h1:NREThFqKR1f3iQ6oBuvc5LadQuXVGo9rkm5ZGrQdJfM= +google.golang.org/grpc v1.38.0/go.mod h1:NREThFqKR1f3iQ6oBuvc5LadQuXVGo9rkm5ZGrQdJfM= +google.golang.org/grpc v1.39.0/go.mod h1:PImNr+rS9TWYb2O4/emRugxiyHZ5JyHW5F+RPnDzfrE= +google.golang.org/grpc v1.39.1/go.mod h1:PImNr+rS9TWYb2O4/emRugxiyHZ5JyHW5F+RPnDzfrE= +google.golang.org/grpc v1.40.0/go.mod h1:ogyxbiOoUXAkP+4+xa6PZSE9DZgIHtSpzjDTB9KAK34= +google.golang.org/grpc v1.40.1/go.mod h1:ogyxbiOoUXAkP+4+xa6PZSE9DZgIHtSpzjDTB9KAK34= +google.golang.org/grpc v1.44.0/go.mod h1:k+4IHHFw41K8+bbowsex27ge2rCb65oeWqe4jJ590SU= +google.golang.org/grpc v1.45.0/go.mod h1:lN7owxKUQEqMfSyQikvvk5tf/6zMPsrK+ONuO11+0rQ= +google.golang.org/grpc v1.46.0/go.mod h1:vN9eftEi1UMyUsIF80+uQXhHjbXYbm0uXoFCACuMGWk= +google.golang.org/grpc v1.46.2/go.mod h1:vN9eftEi1UMyUsIF80+uQXhHjbXYbm0uXoFCACuMGWk= +google.golang.org/grpc v1.47.0/go.mod h1:vN9eftEi1UMyUsIF80+uQXhHjbXYbm0uXoFCACuMGWk= +google.golang.org/grpc v1.48.0/go.mod h1:vN9eftEi1UMyUsIF80+uQXhHjbXYbm0uXoFCACuMGWk= +google.golang.org/grpc v1.49.0/go.mod h1:ZgQEeidpAuNRZ8iRrlBKXZQP1ghovWIVhdJRyCDK+GI= +google.golang.org/grpc v1.50.0/go.mod h1:ZgQEeidpAuNRZ8iRrlBKXZQP1ghovWIVhdJRyCDK+GI= +google.golang.org/grpc v1.50.1/go.mod h1:ZgQEeidpAuNRZ8iRrlBKXZQP1ghovWIVhdJRyCDK+GI= +google.golang.org/grpc v1.56.3 h1:8I4C0Yq1EjstUzUJzpcRVbuYA2mODtEmpWiQoN/b2nc= +google.golang.org/grpc v1.56.3/go.mod h1:I9bI3vqKfayGqPUAwGdOSu7kt6oIJLixfffKrpXqQ9s= +google.golang.org/grpc/cmd/protoc-gen-go-grpc v1.1.0/go.mod h1:6Kw0yEErY5E/yWrBtf03jp27GLLJujG4z/JK95pnjjw= +google.golang.org/protobuf v0.0.0-20200109180630-ec00e32a8dfd/go.mod h1:DFci5gLYBciE7Vtevhsrf46CRTquxDuWsQurQQe4oz8= +google.golang.org/protobuf v0.0.0-20200221191635-4d8936d0db64/go.mod h1:kwYJMbMJ01Woi6D6+Kah6886xMZcty6N08ah7+eCXa0= +google.golang.org/protobuf v0.0.0-20200228230310-ab0ca4ff8a60/go.mod h1:cfTl7dwQJ+fmap5saPgwCLgHXTUD7jkjRqWcaiX5VyM= +google.golang.org/protobuf v1.20.1-0.20200309200217-e05f789c0967/go.mod h1:A+miEFZTKqfCUM6K7xSMQL9OKL/b6hQv+e19PK+JZNE= +google.golang.org/protobuf v1.21.0/go.mod h1:47Nbq4nVaFHyn7ilMalzfO3qCViNmqZ2kzikPIcrTAo= +google.golang.org/protobuf v1.22.0/go.mod h1:EGpADcykh3NcUnDUJcl1+ZksZNG86OlYog2l/sGQquU= +google.golang.org/protobuf v1.23.0/go.mod h1:EGpADcykh3NcUnDUJcl1+ZksZNG86OlYog2l/sGQquU= +google.golang.org/protobuf v1.23.1-0.20200526195155-81db48ad09cc/go.mod h1:EGpADcykh3NcUnDUJcl1+ZksZNG86OlYog2l/sGQquU= +google.golang.org/protobuf v1.24.0/go.mod h1:r/3tXBNzIEhYS9I1OUVjXDlt8tc493IdKGjtUeSXeh4= +google.golang.org/protobuf v1.25.0/go.mod h1:9JNX74DMeImyA3h4bdi1ymwjUzf21/xIlbajtzgsN7c= +google.golang.org/protobuf v1.26.0-rc.1/go.mod h1:jlhhOSvTdKEhbULTjvd4ARK9grFBp09yW+WbY/TyQbw= +google.golang.org/protobuf v1.26.0/go.mod h1:9q0QmTI4eRPtz6boOQmLYwt+qCgq0jsYwAQnmE0givc= +google.golang.org/protobuf v1.27.1/go.mod h1:9q0QmTI4eRPtz6boOQmLYwt+qCgq0jsYwAQnmE0givc= +google.golang.org/protobuf v1.28.0/go.mod h1:HV8QOd/L58Z+nl8r43ehVNZIU/HEI6OcFqwMG9pJV4I= +google.golang.org/protobuf v1.28.1/go.mod h1:HV8QOd/L58Z+nl8r43ehVNZIU/HEI6OcFqwMG9pJV4I= +google.golang.org/protobuf v1.33.0 h1:uNO2rsAINq/JlFpSdYEKIZ0uKD/R9cpdv0T+yoGwGmI= +google.golang.org/protobuf v1.33.0/go.mod h1:c6P6GXX6sHbq/GpV6MGZEdwhWPcYBgnhAHhKbcUYpos= +gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0= +gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127 h1:qIbj1fsPNlZgppZ+VLlY7N33q108Sa+fhmuc+sWQYwY= +gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0= +gopkg.in/cheggaaa/pb.v1 v1.0.27/go.mod h1:V/YB90LKu/1FcN3WVnfiiE5oMCibMjukxqG/qStrOgw= +gopkg.in/errgo.v2 v2.1.0/go.mod h1:hNsd1EY+bozCKY1Ytp96fpM3vjJbqLJn88ws8XvfDNI= +gopkg.in/yaml.v2 v2.2.2/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI= +gopkg.in/yaml.v2 v2.2.3/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI= +gopkg.in/yaml.v2 v2.2.8/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI= +gopkg.in/yaml.v2 v2.4.0 h1:D8xgwECY7CYvx+Y2n4sBz93Jn9JRvxdiyyo8CTfuKaY= +gopkg.in/yaml.v2 v2.4.0/go.mod h1:RDklbk79AGWmwhnvt/jBztapEOGDOx6ZbXqjP6csGnQ= +gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= +gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA= +gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= +honnef.co/go/tools v0.0.0-20190102054323-c2f93a96b099/go.mod h1:rf3lG4BRIbNafJWhAfAdb/ePZxsR/4RtNHQocxwk9r4= +honnef.co/go/tools v0.0.0-20190106161140-3f1c8253044a/go.mod h1:rf3lG4BRIbNafJWhAfAdb/ePZxsR/4RtNHQocxwk9r4= +honnef.co/go/tools v0.0.0-20190418001031-e561f6794a2a/go.mod h1:rf3lG4BRIbNafJWhAfAdb/ePZxsR/4RtNHQocxwk9r4= +honnef.co/go/tools v0.0.0-20190523083050-ea95bdfd59fc/go.mod h1:rf3lG4BRIbNafJWhAfAdb/ePZxsR/4RtNHQocxwk9r4= +honnef.co/go/tools v0.0.1-2019.2.3/go.mod h1:a3bituU0lyd329TUQxRnasdCoJDkEUEAqEt0JzvZhAg= +honnef.co/go/tools v0.0.1-2020.1.3/go.mod h1:X/FiERA/W4tHapMX5mGpAtMSVEeEUOyHaw9vFzvIQ3k= +honnef.co/go/tools v0.0.1-2020.1.4/go.mod h1:X/FiERA/W4tHapMX5mGpAtMSVEeEUOyHaw9vFzvIQ3k= +rsc.io/binaryregexp v0.2.0/go.mod h1:qTv7/COck+e2FymRvadv62gMdZztPaShugOCi3I+8D8= +rsc.io/quote/v3 v3.1.0/go.mod h1:yEA65RcK8LyAZtP9Kv3t0HmxON59tX3rD+tICJqUlj0= +rsc.io/sampler v1.3.0/go.mod h1:T1hPZKmBbMNahiBKFy5HrXp6adAjACjK9JXDnKaTXpA= diff --git a/tests/integration/nat_zero_test.go b/tests/integration/nat_zero_test.go new file mode 100644 index 0000000..e08350d --- /dev/null +++ b/tests/integration/nat_zero_test.go @@ -0,0 +1,888 @@ +package test + +import ( + "encoding/base64" + "encoding/json" + "fmt" + "strings" + "testing" + "time" + + "github.com/aws/aws-sdk-go/aws" + "github.com/aws/aws-sdk-go/aws/session" + "github.com/aws/aws-sdk-go/service/cloudwatchevents" + "github.com/aws/aws-sdk-go/service/cloudwatchlogs" + "github.com/aws/aws-sdk-go/service/ec2" + "github.com/aws/aws-sdk-go/service/iam" + "github.com/aws/aws-sdk-go/service/lambda" + "github.com/aws/aws-sdk-go/service/sqs" + "github.com/gruntwork-io/terratest/modules/retry" + "github.com/gruntwork-io/terratest/modules/terraform" + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" +) + +const ( + awsRegion = "us-east-1" + natTagKey = "nat-zero:managed" + natTagValue = "true" + testTagKey = "TerratestRun" +) + +// userDataScript generates a base64-encoded userdata script that curls +// checkip.amazonaws.com and sends the result to the given SQS queue URL. +func userDataScript(queueURL string) string { + return base64.StdEncoding.EncodeToString([]byte(fmt.Sprintf(`#!/bin/bash +BOOT_MS=$(($(date +%%s%%N)/1000000)) +TOKEN=$(curl -sf -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 60") +IID=$(curl -sf -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/instance-id) +REGION=$(curl -sf -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/placement/region) +for i in $(seq 1 60); do + IP=$(curl -sf --max-time 5 https://checkip.amazonaws.com) && break + sleep 2 +done +CONNECTED_MS=$(($(date +%%s%%N)/1000000)) +if [ -n "$IP" ]; then + MSG=$(printf '{"instance_id":"%%s","egress_ip":"%%s","boot_ms":%%d,"connected_ms":%%d}' "$IID" "$IP" "$BOOT_MS" "$CONNECTED_MS") + aws sqs send-message --queue-url "%s" --message-body "$MSG" --region "$REGION" +fi +`, queueURL))) +} + +// phase records the name and duration of a test phase for the timing summary. +type phase struct { + name string + duration time.Duration +} + +// TestNatZero exercises the full NAT lifecycle: deploy, NAT creation, +// connectivity, scale-down, restart, cleanup action, and terraform destroy. +func TestNatZero(t *testing.T) { + runID := fmt.Sprintf("tt-%d", time.Now().Unix()) + sess := session.Must(session.NewSession(&aws.Config{Region: aws.String(awsRegion)})) + ec2Client := ec2.New(sess) + iamClient := iam.New(sess) + lambdaClient := lambda.New(sess) + sqsClient := sqs.New(sess) + + // Timing infrastructure — records duration of each test phase. + var phases []phase + record := func(name string, d time.Duration) { + phases = append(phases, phase{name, d}) + t.Logf("[TIMER] %-45s %s", name, d.Round(time.Millisecond)) + } + defer func() { + t.Log("") + t.Log("=== TIMING SUMMARY ===") + t.Logf(" %-45s %s", "PHASE", "DURATION") + t.Log(" " + strings.Repeat("-", 60)) + var total time.Duration + for _, p := range phases { + total += p.duration + t.Logf(" %-45s %s", p.name, p.duration.Round(time.Millisecond)) + } + t.Log(" " + strings.Repeat("-", 60)) + t.Logf(" %-45s %s", "TOTAL", total.Round(time.Millisecond)) + t.Log("=== END TIMING SUMMARY ===") + }() + + // Create workload IAM profile first — propagates while Terraform applies. + iamStart := time.Now() + profileName := createWorkloadProfile(t, iamClient, runID) + record("IAM profile creation", time.Since(iamStart)) + defer deleteWorkloadProfile(t, iamClient, runID) + + // Create SQS queue for workload connectivity reporting. + queueName := fmt.Sprintf("nat-test-%s", runID) + createOut, err := sqsClient.CreateQueue(&sqs.CreateQueueInput{ + QueueName: aws.String(queueName), + }) + require.NoError(t, err) + queueURL := aws.StringValue(createOut.QueueUrl) + t.Logf("Created SQS queue: %s", queueURL) + defer func() { + sqsClient.DeleteQueue(&sqs.DeleteQueueInput{QueueUrl: aws.String(queueURL)}) + t.Logf("Deleted SQS queue %s", queueName) + }() + + opts := terraform.WithDefaultRetryableErrors(t, &terraform.Options{ + TerraformDir: "./fixture", + NoColor: true, + }) + defer func() { + destroyStart := time.Now() + terraform.Destroy(t, opts) + record("Terraform destroy", time.Since(destroyStart)) + }() + tfStart := time.Now() + terraform.InitAndApply(t, opts) + record("Terraform init+apply", time.Since(tfStart)) + + vpcID := terraform.Output(t, opts, "vpc_id") + privateSubnet := terraform.Output(t, opts, "private_subnet_id") + lambdaName := terraform.Output(t, opts, "lambda_function_name") + t.Logf("VPC: %s, private subnet: %s, Lambda: %s", vpcID, privateSubnet, lambdaName) + + // Terminate test workload instances before terraform destroy. + defer func() { + t.Log("Terminating test workload instances...") + out, err := ec2Client.DescribeInstances(&ec2.DescribeInstancesInput{ + Filters: []*ec2.Filter{ + {Name: aws.String(fmt.Sprintf("tag:%s", testTagKey)), Values: []*string{aws.String(runID)}}, + {Name: aws.String("instance-state-name"), Values: []*string{ + aws.String("pending"), aws.String("running"), + aws.String("stopping"), aws.String("stopped"), + }}, + }, + }) + if err != nil { + t.Logf("Warning: describe instances: %v", err) + return + } + var ids []*string + for _, r := range out.Reservations { + for _, i := range r.Instances { + ids = append(ids, i.InstanceId) + } + } + if len(ids) > 0 { + t.Logf("Terminating %d test workload instances", len(ids)) + ec2Client.TerminateInstances(&ec2.TerminateInstancesInput{InstanceIds: ids}) + ec2Client.WaitUntilInstanceTerminated(&ec2.DescribeInstancesInput{InstanceIds: ids}) + } + }() + + // Dump Lambda CloudWatch logs on failure for diagnostics. + cwClient := cloudwatchlogs.New(sess) + logGroup := fmt.Sprintf("/aws/lambda/%s", lambdaName) + defer func() { + if t.Failed() { + dumpLambdaLogs(t, cwClient, logGroup) + } + }() + + amiID := getLatestAL2023AMI(t, ec2Client) + + // Shared across phases — set by Phase 1, used by Phase 2. + var workloadID string + + // ── Phase 1: NAT creation and connectivity ────────────────────────── + // Launch a workload and let EventBridge trigger the Lambda automatically. + + t.Run("NATCreationAndConnectivity", func(t *testing.T) { + wlStart := time.Now() + workloadID = launchWorkload(t, ec2Client, privateSubnet, amiID, runID, profileName, queueURL) + record("Launch workload instance", time.Since(wlStart)) + t.Logf("Launched workload %s in VPC %s", workloadID, vpcID) + + // EventBridge fires when the workload goes pending/running, + // triggering the Lambda to create a NAT and attach an EIP. + t.Log("Waiting for NAT to be running with EIP (via EventBridge)...") + start := time.Now() + var natInstance *ec2.Instance + retry.DoWithRetry(t, "NAT running with EIP", 100, 2*time.Second, func() (string, error) { + nats := findNATInstances(t, ec2Client, vpcID) + for _, n := range nats { + if aws.StringValue(n.State.Name) == "running" { + for _, eni := range n.NetworkInterfaces { + if aws.Int64Value(eni.Attachment.DeviceIndex) == 0 && + eni.Association != nil && eni.Association.PublicIp != nil { + natInstance = n + return "OK", nil + } + } + return "", fmt.Errorf("NAT running but no EIP yet") + } + } + return "", fmt.Errorf("no running NAT (%d found)", len(nats)) + }) + natUpTime := time.Since(start) + record("Wait for NAT running with EIP", natUpTime) + t.Logf("NAT up with EIP in %s", natUpTime.Round(time.Millisecond)) + + // Get NAT public IP from primary ENI. + var natEIP string + for _, eni := range natInstance.NetworkInterfaces { + if aws.Int64Value(eni.Attachment.DeviceIndex) == 0 && eni.Association != nil { + natEIP = aws.StringValue(eni.Association.PublicIp) + break + } + } + require.NotEmpty(t, natEIP, "NAT should have a public IP") + + // Validate NAT tags. + hasScalingTag := false + for _, tag := range natInstance.Tags { + if aws.StringValue(tag.Key) == natTagKey && aws.StringValue(tag.Value) == natTagValue { + hasScalingTag = true + break + } + } + assert.True(t, hasScalingTag, "NAT missing tag %s=%s", natTagKey, natTagValue) + + // Validate dual ENIs (public + private). + eniIndices := map[int64]bool{} + for _, eni := range natInstance.NetworkInterfaces { + eniIndices[aws.Int64Value(eni.Attachment.DeviceIndex)] = true + } + assert.True(t, eniIndices[0] && eniIndices[1], "NAT should have ENIs at device index 0 and 1") + + assertRouteTableEntry(t, ec2Client, vpcID, natInstance) + + // Wait for workload to report its egress IP via SQS. + t.Log("Waiting for workload connectivity check (SQS)...") + egressStart := time.Now() + msg := waitForEgress(t, sqsClient, queueURL, 4*time.Minute) + record("Wait for workload egress IP", time.Since(egressStart)) + if msg.ConnectedMs > 0 && msg.BootMs > 0 { + t.Logf("Workload-measured connectivity latency: %dms", msg.ConnectedMs-msg.BootMs) + } + assert.Equal(t, natEIP, msg.EgressIP, + "workload egress IP should match NAT EIP") + t.Logf("Confirmed: workload egresses via NAT EIP %s", natEIP) + }) + + // ── Phase 2: NAT scale-down ───────────────────────────────────────── + // Terminate the workload and let EventBridge drive the full + // scale-down flow: stop NAT, then detach/release EIP. + + t.Run("NATScaleDown", func(t *testing.T) { + require.NotEmpty(t, workloadID, "Phase 1 must set workloadID") + + // Terminate the workload instance. EventBridge fires shutting-down + // and terminated events which trigger the Lambda to stop the NAT. + t.Log("Terminating workload to trigger NAT scale-down...") + termStart := time.Now() + _, err := ec2Client.TerminateInstances(&ec2.TerminateInstancesInput{ + InstanceIds: []*string{aws.String(workloadID)}, + }) + require.NoError(t, err) + record("Terminate workload instance", time.Since(termStart)) + + // Wait for NAT to reach stopped state. + t.Log("Waiting for NAT to stop (via EventBridge)...") + stopStart := time.Now() + retry.DoWithRetry(t, "NAT stopped", 100, 2*time.Second, func() (string, error) { + nats := findNATInstancesInState(t, ec2Client, vpcID, + []string{"pending", "running", "stopping", "stopped"}) + for _, n := range nats { + state := aws.StringValue(n.State.Name) + if state == "stopped" { + return "OK", nil + } + if state == "stopping" { + return "", fmt.Errorf("NAT still stopping") + } + return "", fmt.Errorf("NAT in unexpected state: %s", state) + } + return "", fmt.Errorf("no NAT instances found") + }) + natStopTime := time.Since(stopStart) + record("Wait for NAT stopped", natStopTime) + t.Logf("NAT stopped in %s", natStopTime.Round(time.Second)) + + // EventBridge fires the NAT's stopping/stopped events which trigger + // the Lambda to detach and release the EIP automatically. + t.Log("Verifying EIP released (via EventBridge)...") + eipStart := time.Now() + retry.DoWithRetry(t, "EIP released", 20, 5*time.Second, func() (string, error) { + out, err := ec2Client.DescribeAddresses(&ec2.DescribeAddressesInput{ + Filters: []*ec2.Filter{ + {Name: aws.String(fmt.Sprintf("tag:%s", natTagKey)), + Values: []*string{aws.String(natTagValue)}}, + }, + }) + if err != nil { + return "", err + } + if len(out.Addresses) > 0 { + return "", fmt.Errorf("still %d NAT EIPs", len(out.Addresses)) + } + return "OK", nil + }) + record("Wait for EIP released", time.Since(eipStart)) + t.Log("NAT stopped and EIP released") + }) + + // ── Phase 3: NAT restart from stopped state ───────────────────────── + // Launch a new workload and let EventBridge trigger the restart. + + t.Run("NATRestart", func(t *testing.T) { + t.Log("Launching new workload to trigger NAT restart...") + wlStart := time.Now() + newWorkloadID := launchWorkload(t, ec2Client, privateSubnet, amiID, runID, profileName, queueURL) + record("Launch workload instance (restart)", time.Since(wlStart)) + t.Logf("Launched workload %s", newWorkloadID) + + // EventBridge fires when the new workload goes pending/running, + // triggering the Lambda to start the stopped NAT. + t.Log("Waiting for restarted NAT to be running with EIP (via EventBridge)...") + start := time.Now() + var natInstance *ec2.Instance + retry.DoWithRetry(t, "NAT restarted with EIP", 100, 2*time.Second, func() (string, error) { + nats := findNATInstances(t, ec2Client, vpcID) + for _, n := range nats { + if aws.StringValue(n.State.Name) == "running" { + for _, eni := range n.NetworkInterfaces { + if aws.Int64Value(eni.Attachment.DeviceIndex) == 0 && + eni.Association != nil && eni.Association.PublicIp != nil { + natInstance = n + return "OK", nil + } + } + return "", fmt.Errorf("NAT running but no EIP yet") + } + } + return "", fmt.Errorf("no running NAT (%d found)", len(nats)) + }) + natRestartTime := time.Since(start) + record("Wait for NAT restarted with EIP", natRestartTime) + t.Logf("NAT restarted with EIP in %s", natRestartTime.Round(time.Millisecond)) + + require.NotNil(t, natInstance, "NAT should be running") + + // Verify the restarted NAT has an EIP. + var natEIP string + for _, eni := range natInstance.NetworkInterfaces { + if aws.Int64Value(eni.Attachment.DeviceIndex) == 0 && eni.Association != nil { + natEIP = aws.StringValue(eni.Association.PublicIp) + break + } + } + require.NotEmpty(t, natEIP, "Restarted NAT should have a public IP") + t.Logf("Restarted NAT has EIP %s", natEIP) + + // Verify connectivity — wait for new workload to report egress IP via SQS. + t.Log("Waiting for workload connectivity via restarted NAT (SQS)...") + egressStart := time.Now() + msg := waitForEgress(t, sqsClient, queueURL, 4*time.Minute) + record("Wait for workload egress IP (restart)", time.Since(egressStart)) + if msg.ConnectedMs > 0 && msg.BootMs > 0 { + t.Logf("Workload-measured connectivity latency: %dms", msg.ConnectedMs-msg.BootMs) + } + require.NotEmpty(t, msg.EgressIP, "workload should have internet connectivity via restarted NAT") + if msg.EgressIP == natEIP { + t.Logf("Workload egresses via NAT EIP %s", natEIP) + } else { + t.Logf("Workload egressed via NAT auto-assigned IP %s (EIP %s attached after; expected during restart)", msg.EgressIP, natEIP) + } + }) + + // ── Phase 4: Cleanup action ───────────────────────────────────────── + + t.Run("CleanupAction", func(t *testing.T) { + // Count EIPs tagged by the Lambda before cleanup. + addrOut, err := ec2Client.DescribeAddresses(&ec2.DescribeAddressesInput{ + Filters: []*ec2.Filter{ + {Name: aws.String(fmt.Sprintf("tag:%s", natTagKey)), + Values: []*string{aws.String(natTagValue)}}, + }, + }) + require.NoError(t, err) + require.Greater(t, len(addrOut.Addresses), 0, "should have at least one NAT EIP before cleanup") + + t.Log("Invoking Lambda with cleanup action...") + cleanupStart := time.Now() + invokeLambda(t, lambdaClient, lambdaName, map[string]string{"action": "cleanup"}) + record("Lambda invoke (cleanup)", time.Since(cleanupStart)) + + // Verify NAT instances are terminated. + t.Log("Verifying NAT instances terminated...") + natTermStart := time.Now() + retry.DoWithRetry(t, "NAT terminated", 20, 5*time.Second, func() (string, error) { + nats := findNATInstances(t, ec2Client, vpcID) + if len(nats) > 0 { + return "", fmt.Errorf("still %d running NAT instances", len(nats)) + } + return "OK", nil + }) + record("Wait for NAT terminated", time.Since(natTermStart)) + + // Verify EIPs are released. + t.Log("Verifying EIPs released...") + eipStart := time.Now() + retry.DoWithRetry(t, "EIPs released", 10, 5*time.Second, func() (string, error) { + out, err := ec2Client.DescribeAddresses(&ec2.DescribeAddressesInput{ + Filters: []*ec2.Filter{ + {Name: aws.String(fmt.Sprintf("tag:%s", natTagKey)), + Values: []*string{aws.String(natTagValue)}}, + }, + }) + if err != nil { + return "", err + } + if len(out.Addresses) > 0 { + return "", fmt.Errorf("still %d NAT EIPs", len(out.Addresses)) + } + return "OK", nil + }) + record("Wait for EIPs released", time.Since(eipStart)) + t.Log("Cleanup action verified: NAT instances terminated and EIPs released") + }) + + // terraform destroy runs via deferred cleanup — should succeed cleanly + // since the cleanup action already removed Lambda-created resources. +} + +// ── Lambda helpers ──────────────────────────────────────────────────────── + +// invokeLambda calls the nat-zero Lambda with the given payload. Requests log +// tailing to capture and display the Lambda REPORT line. +func invokeLambda(t *testing.T, client *lambda.Lambda, funcName string, payload map[string]string) { + t.Helper() + body, _ := json.Marshal(payload) + out, err := client.Invoke(&lambda.InvokeInput{ + FunctionName: aws.String(funcName), + Payload: body, + LogType: aws.String("Tail"), + }) + require.NoError(t, err, "Lambda invocation failed") + if out.FunctionError != nil { + t.Fatalf("Lambda returned error (%s): %s", + aws.StringValue(out.FunctionError), string(out.Payload)) + } + + if out.LogResult != nil { + logBytes, _ := base64.StdEncoding.DecodeString(aws.StringValue(out.LogResult)) + for _, line := range strings.Split(string(logBytes), "\n") { + trimmed := strings.TrimSpace(line) + if strings.HasPrefix(trimmed, "REPORT") { + t.Logf("[LAMBDA REPORT] %s", trimmed) + } + } + } + + t.Logf("Lambda invoked: %v", payload) +} + +// dumpLambdaLogs prints recent Lambda CloudWatch log events for post-mortem debugging. +func dumpLambdaLogs(t *testing.T, client *cloudwatchlogs.CloudWatchLogs, logGroup string) { + t.Helper() + t.Logf("=== Lambda logs from %s ===", logGroup) + streams, err := client.DescribeLogStreams(&cloudwatchlogs.DescribeLogStreamsInput{ + LogGroupName: aws.String(logGroup), + OrderBy: aws.String("LastEventTime"), + Descending: aws.Bool(true), + Limit: aws.Int64(5), + }) + if err != nil || len(streams.LogStreams) == 0 { + t.Log("No log streams found") + return + } + for _, stream := range streams.LogStreams { + t.Logf("--- stream: %s ---", aws.StringValue(stream.LogStreamName)) + events, err := client.GetLogEvents(&cloudwatchlogs.GetLogEventsInput{ + LogGroupName: aws.String(logGroup), + LogStreamName: stream.LogStreamName, + StartFromHead: aws.Bool(false), + Limit: aws.Int64(50), + }) + if err != nil { + t.Logf("Warning: could not read log events: %v", err) + continue + } + for _, e := range events.Events { + t.Logf(" [%s] %s", + time.UnixMilli(aws.Int64Value(e.Timestamp)).UTC().Format("15:04:05"), + strings.TrimSpace(aws.StringValue(e.Message))) + } + } + t.Log("=== End Lambda logs ===") +} + +// ── IAM (workload needs sqs:SendMessage for connectivity reporting) ────── + +func createWorkloadProfile(t *testing.T, client *iam.IAM, runID string) string { + t.Helper() + name := fmt.Sprintf("nat-test-wl-%s", runID) + tags := []*iam.Tag{{Key: aws.String(testTagKey), Value: aws.String(runID)}} + + _, err := client.CreateRole(&iam.CreateRoleInput{ + RoleName: aws.String(name), + Tags: tags, + AssumeRolePolicyDocument: aws.String(`{ + "Version":"2012-10-17", + "Statement":[{"Effect":"Allow","Principal":{"Service":"ec2.amazonaws.com"},"Action":"sts:AssumeRole"}] + }`), + }) + require.NoError(t, err) + + _, err = client.PutRolePolicy(&iam.PutRolePolicyInput{ + RoleName: aws.String(name), + PolicyName: aws.String("sqs-send"), + PolicyDocument: aws.String(`{ + "Version":"2012-10-17", + "Statement":[{"Effect":"Allow","Action":"sqs:SendMessage","Resource":"*"}] + }`), + }) + require.NoError(t, err) + + _, err = client.CreateInstanceProfile(&iam.CreateInstanceProfileInput{ + InstanceProfileName: aws.String(name), + Tags: tags, + }) + require.NoError(t, err) + + _, err = client.AddRoleToInstanceProfile(&iam.AddRoleToInstanceProfileInput{ + InstanceProfileName: aws.String(name), + RoleName: aws.String(name), + }) + require.NoError(t, err) + return name +} + +func deleteWorkloadProfile(t *testing.T, client *iam.IAM, runID string) { + t.Helper() + name := fmt.Sprintf("nat-test-wl-%s", runID) + client.RemoveRoleFromInstanceProfile(&iam.RemoveRoleFromInstanceProfileInput{ + InstanceProfileName: aws.String(name), RoleName: aws.String(name), + }) + client.DeleteInstanceProfile(&iam.DeleteInstanceProfileInput{ + InstanceProfileName: aws.String(name), + }) + client.DeleteRolePolicy(&iam.DeleteRolePolicyInput{ + RoleName: aws.String(name), PolicyName: aws.String("sqs-send"), + }) + client.DeleteRole(&iam.DeleteRoleInput{RoleName: aws.String(name)}) + t.Logf("Deleted IAM profile %s", name) +} + +// ── EC2 helpers ────────────────────────────────────────────────────────── + +func getLatestAL2023AMI(t *testing.T, c *ec2.EC2) string { + t.Helper() + out, err := c.DescribeImages(&ec2.DescribeImagesInput{ + Owners: []*string{aws.String("amazon")}, + Filters: []*ec2.Filter{ + {Name: aws.String("name"), Values: []*string{aws.String("al2023-ami-2023*-arm64")}}, + {Name: aws.String("state"), Values: []*string{aws.String("available")}}, + }, + }) + require.NoError(t, err) + var latest *ec2.Image + for _, img := range out.Images { + if strings.Contains(aws.StringValue(img.Name), "minimal") { + continue + } + if latest == nil || aws.StringValue(img.CreationDate) > aws.StringValue(latest.CreationDate) { + latest = img + } + } + require.NotNil(t, latest, "no standard AL2023 ARM64 AMI found") + return aws.StringValue(latest.ImageId) +} + +func findNATInstances(t *testing.T, c *ec2.EC2, vpcID string) []*ec2.Instance { + t.Helper() + return findNATInstancesInState(t, c, vpcID, []string{"pending", "running"}) +} + +func findNATInstancesInState(t *testing.T, c *ec2.EC2, vpcID string, states []string) []*ec2.Instance { + t.Helper() + stateValues := make([]*string, len(states)) + for i, s := range states { + stateValues[i] = aws.String(s) + } + out, err := c.DescribeInstances(&ec2.DescribeInstancesInput{ + Filters: []*ec2.Filter{ + {Name: aws.String(fmt.Sprintf("tag:%s", natTagKey)), Values: []*string{aws.String(natTagValue)}}, + {Name: aws.String("vpc-id"), Values: []*string{aws.String(vpcID)}}, + {Name: aws.String("instance-state-name"), Values: stateValues}, + }, + }) + require.NoError(t, err) + var res []*ec2.Instance + for _, r := range out.Reservations { + res = append(res, r.Instances...) + } + return res +} + +func launchWorkload(t *testing.T, c *ec2.EC2, subnet, ami, runID, profile, queueURL string) string { + t.Helper() + out, err := c.RunInstances(&ec2.RunInstancesInput{ + ImageId: aws.String(ami), + InstanceType: aws.String("t4g.nano"), + SubnetId: aws.String(subnet), + MinCount: aws.Int64(1), + MaxCount: aws.Int64(1), + UserData: aws.String(userDataScript(queueURL)), + IamInstanceProfile: &ec2.IamInstanceProfileSpecification{ + Name: aws.String(profile), + }, + TagSpecifications: []*ec2.TagSpecification{{ + ResourceType: aws.String("instance"), + Tags: []*ec2.Tag{ + {Key: aws.String("Name"), Value: aws.String("nat-zero-test-workload")}, + {Key: aws.String(testTagKey), Value: aws.String(runID)}, + }, + }}, + }) + require.NoError(t, err) + return aws.StringValue(out.Instances[0].InstanceId) +} + +// egressMessage is the JSON payload the workload sends to SQS on connectivity. +type egressMessage struct { + InstanceID string `json:"instance_id"` + EgressIP string `json:"egress_ip"` + BootMs int64 `json:"boot_ms"` + ConnectedMs int64 `json:"connected_ms"` +} + +// waitForEgress uses SQS long polling to wait for a workload to report its +// egress IP. Returns near-instantly when the message arrives instead of +// polling EC2 tags every 5 seconds. +func waitForEgress(t *testing.T, client *sqs.SQS, queueURL string, timeout time.Duration) egressMessage { + t.Helper() + deadline := time.Now().Add(timeout) + for time.Now().Before(deadline) { + out, err := client.ReceiveMessage(&sqs.ReceiveMessageInput{ + QueueUrl: aws.String(queueURL), + MaxNumberOfMessages: aws.Int64(1), + WaitTimeSeconds: aws.Int64(20), + }) + require.NoError(t, err) + if len(out.Messages) > 0 { + // Delete the message so it doesn't interfere with the next phase. + client.DeleteMessage(&sqs.DeleteMessageInput{ + QueueUrl: aws.String(queueURL), + ReceiptHandle: out.Messages[0].ReceiptHandle, + }) + var msg egressMessage + require.NoError(t, json.Unmarshal([]byte(aws.StringValue(out.Messages[0].Body)), &msg)) + return msg + } + } + t.Fatalf("timed out waiting for egress message on SQS queue %s", queueURL) + return egressMessage{} // unreachable +} + +func assertRouteTableEntry(t *testing.T, c *ec2.EC2, vpcID string, nat *ec2.Instance) { + t.Helper() + var privateENI string + for _, eni := range nat.NetworkInterfaces { + if aws.Int64Value(eni.Attachment.DeviceIndex) == 1 { + privateENI = aws.StringValue(eni.NetworkInterfaceId) + } + } + require.NotEmpty(t, privateENI) + + out, err := c.DescribeRouteTables(&ec2.DescribeRouteTablesInput{ + Filters: []*ec2.Filter{ + {Name: aws.String("vpc-id"), Values: []*string{aws.String(vpcID)}}, + {Name: aws.String("route.destination-cidr-block"), Values: []*string{aws.String("0.0.0.0/0")}}, + }, + }) + require.NoError(t, err) + + for _, rt := range out.RouteTables { + for _, r := range rt.Routes { + if aws.StringValue(r.DestinationCidrBlock) == "0.0.0.0/0" && + strings.EqualFold(aws.StringValue(r.NetworkInterfaceId), privateENI) { + return + } + } + } + t.Errorf("no 0.0.0.0/0 route pointing to NAT private ENI %s", privateENI) +} + +// ── Orphan detection ───────────────────────────────────────────────────── + +// TestNoOrphanedResources searches for resources left behind by previous test +// runs. It runs last (Go runs tests in source order within a package) and +// reports any orphans so they can be cleaned up. +func TestNoOrphanedResources(t *testing.T) { + sess := session.Must(session.NewSession(&aws.Config{Region: aws.String(awsRegion)})) + ec2Client := ec2.New(sess) + iamClient := iam.New(sess) + lambdaClient := lambda.New(sess) + cwClient := cloudwatchlogs.New(sess) + sqsClient := sqs.New(sess) + + const testPrefix = "nat-test" + checks := []struct { + name string + checkFn func() []string + }{ + {"Subnets", func() []string { + out, err := ec2Client.DescribeSubnets(&ec2.DescribeSubnetsInput{ + Filters: []*ec2.Filter{ + {Name: aws.String("cidr-block"), Values: []*string{aws.String("172.31.128.0/24")}}, + }, + }) + if err != nil { + return nil + } + var found []string + for _, s := range out.Subnets { + found = append(found, fmt.Sprintf("Subnet %s (%s)", + aws.StringValue(s.SubnetId), aws.StringValue(s.CidrBlock))) + } + return found + }}, + {"ENIs", func() []string { + out, err := ec2Client.DescribeNetworkInterfaces(&ec2.DescribeNetworkInterfacesInput{ + Filters: []*ec2.Filter{ + {Name: aws.String("tag:Name"), Values: []*string{aws.String(testPrefix + "-*")}}, + }, + }) + if err != nil { + return nil + } + var found []string + for _, e := range out.NetworkInterfaces { + name := "" + for _, tag := range e.TagSet { + if aws.StringValue(tag.Key) == "Name" { + name = aws.StringValue(tag.Value) + } + } + found = append(found, fmt.Sprintf("ENI %s (%s, %s)", + aws.StringValue(e.NetworkInterfaceId), name, aws.StringValue(e.Status))) + } + return found + }}, + {"SecurityGroups", func() []string { + out, err := ec2Client.DescribeSecurityGroups(&ec2.DescribeSecurityGroupsInput{ + Filters: []*ec2.Filter{ + {Name: aws.String("group-name"), Values: []*string{aws.String(testPrefix + "-*")}}, + }, + }) + if err != nil { + return nil + } + var found []string + for _, sg := range out.SecurityGroups { + found = append(found, fmt.Sprintf("SecurityGroup %s (%s)", + aws.StringValue(sg.GroupId), aws.StringValue(sg.GroupName))) + } + return found + }}, + {"LaunchTemplates", func() []string { + out, err := ec2Client.DescribeLaunchTemplates(&ec2.DescribeLaunchTemplatesInput{ + Filters: []*ec2.Filter{ + {Name: aws.String("launch-template-name"), Values: []*string{aws.String(testPrefix + "-*")}}, + }, + }) + if err != nil { + return nil + } + var found []string + for _, lt := range out.LaunchTemplates { + found = append(found, fmt.Sprintf("LaunchTemplate %s (%s)", + aws.StringValue(lt.LaunchTemplateId), aws.StringValue(lt.LaunchTemplateName))) + } + return found + }}, + {"EventBridgeRules", func() []string { + out, err := cloudwatchevents.New(sess).ListRules(&cloudwatchevents.ListRulesInput{ + NamePrefix: aws.String(testPrefix), + }) + if err != nil { + return nil + } + var found []string + for _, r := range out.Rules { + found = append(found, fmt.Sprintf("EventBridgeRule %s", aws.StringValue(r.Name))) + } + return found + }}, + {"Lambda", func() []string { + _, err := lambdaClient.GetFunction(&lambda.GetFunctionInput{ + FunctionName: aws.String(testPrefix + "-nat-zero"), + }) + if err == nil { + return []string{"Lambda nat-test-nat-zero"} + } + return nil + }}, + {"LogGroups", func() []string { + out, err := cwClient.DescribeLogGroups(&cloudwatchlogs.DescribeLogGroupsInput{ + LogGroupNamePrefix: aws.String("/aws/lambda/" + testPrefix), + }) + if err != nil { + return nil + } + var found []string + for _, lg := range out.LogGroups { + found = append(found, fmt.Sprintf("LogGroup %s", aws.StringValue(lg.LogGroupName))) + } + return found + }}, + {"IAMRoles", func() []string { + out, err := iamClient.ListRoles(&iam.ListRolesInput{}) + if err != nil { + return nil + } + var found []string + for _, r := range out.Roles { + if strings.HasPrefix(aws.StringValue(r.RoleName), testPrefix) { + found = append(found, fmt.Sprintf("IAMRole %s", aws.StringValue(r.RoleName))) + } + } + return found + }}, + {"IAMProfiles", func() []string { + out, err := iamClient.ListInstanceProfiles(&iam.ListInstanceProfilesInput{}) + if err != nil { + return nil + } + var found []string + for _, p := range out.InstanceProfiles { + if strings.HasPrefix(aws.StringValue(p.InstanceProfileName), testPrefix) { + found = append(found, fmt.Sprintf("IAMInstanceProfile %s", + aws.StringValue(p.InstanceProfileName))) + } + } + return found + }}, + {"EIPs", func() []string { + out, err := ec2Client.DescribeAddresses(&ec2.DescribeAddressesInput{ + Filters: []*ec2.Filter{ + {Name: aws.String(fmt.Sprintf("tag:%s", natTagKey)), + Values: []*string{aws.String(natTagValue)}}, + {Name: aws.String(fmt.Sprintf("tag:%s", testTagKey)), + Values: []*string{aws.String("*")}}, + }, + }) + if err != nil { + return nil + } + var found []string + for _, a := range out.Addresses { + found = append(found, fmt.Sprintf("EIP %s (%s)", + aws.StringValue(a.AllocationId), aws.StringValue(a.PublicIp))) + } + return found + }}, + {"SQSQueues", func() []string { + out, err := sqsClient.ListQueues(&sqs.ListQueuesInput{ + QueueNamePrefix: aws.String(testPrefix), + }) + if err != nil { + return nil + } + var found []string + for _, u := range out.QueueUrls { + found = append(found, fmt.Sprintf("SQSQueue %s", aws.StringValue(u))) + } + return found + }}, + } + + var orphans []string + for _, c := range checks { + orphans = append(orphans, c.checkFn()...) + } + + if len(orphans) > 0 { + t.Log("Orphaned resources detected from previous test runs:") + for _, o := range orphans { + t.Logf(" - %s", o) + } + t.Errorf("found %d orphaned test resources — clean up manually or investigate failed runs", len(orphans)) + } else { + t.Log("No orphaned test resources found") + } +} diff --git a/variables.tf b/variables.tf new file mode 100644 index 0000000..74e9d65 --- /dev/null +++ b/variables.tf @@ -0,0 +1,147 @@ +variable "name" { + type = string + description = "Name prefix for all resources created by this module" +} + +variable "tags" { + type = map(string) + default = {} + description = "Additional tags to apply to all resources" +} + +variable "vpc_id" { + type = string + description = "The VPC ID where NAT instances will be deployed" +} + +variable "availability_zones" { + type = list(string) + description = "List of availability zones to deploy NAT instances in" +} + +variable "public_subnets" { + type = list(string) + description = "Public subnet IDs (one per AZ) for NAT instance public ENIs" +} + +variable "private_subnets" { + type = list(string) + description = "Private subnet IDs (one per AZ) for NAT instance private ENIs" +} + +variable "private_route_table_ids" { + type = list(string) + description = "Route table IDs for the private subnets (one per AZ)" +} + +variable "private_subnets_cidr_blocks" { + type = list(string) + description = "CIDR blocks for the private subnets (one per AZ, used in security group rules)" +} + +variable "instance_type" { + type = string + default = "t4g.nano" + description = "Instance type for the NAT instance" +} + +variable "market_type" { + type = string + default = "on-demand" + description = "Whether to use spot or on-demand instances" + + validation { + condition = contains(["spot", "on-demand"], var.market_type) + error_message = "Must be 'spot' or 'on-demand'." + } +} + +variable "block_device_size" { + type = number + default = 10 + description = "Size in GB of the root EBS volume" +} + +# AMI configuration +variable "use_fck_nat_ami" { + type = bool + default = true + description = "Use the public fck-nat AMI. Set to false to use a custom AMI." +} + +variable "ami_id" { + type = string + default = null + description = "Explicit AMI ID to use (overrides AMI lookup entirely)" +} + +variable "custom_ami_owner" { + type = string + default = null + description = "AMI owner account ID when use_fck_nat_ami is false" +} + +variable "custom_ami_name_pattern" { + type = string + default = null + description = "AMI name pattern when use_fck_nat_ami is false" +} + +variable "nat_tag_key" { + type = string + default = "nat-zero:managed" + description = "Tag key used to identify NAT instances" +} + +variable "nat_tag_value" { + type = string + default = "true" + description = "Tag value used to identify NAT instances" +} + +variable "ignore_tag_key" { + type = string + default = "nat-zero:ignore" + description = "Tag key used to mark instances the Lambda should ignore" +} + +variable "ignore_tag_value" { + type = string + default = "true" + description = "Tag value used to mark instances the Lambda should ignore" +} + +variable "lambda_memory_size" { + type = number + default = 256 + description = "Memory allocated to the Lambda function in MB (also scales CPU proportionally)" + + validation { + condition = var.lambda_memory_size >= 128 && var.lambda_memory_size <= 3008 + error_message = "lambda_memory_size must be between 128 and 3008 MB." + } +} + +variable "enable_logging" { + type = bool + default = true + description = "Create a CloudWatch log group for the Lambda function" +} + +variable "log_retention_days" { + type = number + default = 14 + description = "CloudWatch log retention in days (only used when enable_logging is true)" +} + +variable "build_lambda_locally" { + type = bool + default = false + description = "Build the Lambda binary from Go source instead of downloading a pre-compiled release. Requires Go and zip installed locally." +} + +variable "lambda_binary_url" { + type = string + default = "https://github.com/MachineDotDev/nat-zero/releases/download/nat-zero-lambda-latest/lambda.zip" + description = "URL to the pre-compiled Go Lambda zip. Updated automatically by CI." +} diff --git a/versions.tf b/versions.tf new file mode 100644 index 0000000..dd2367d --- /dev/null +++ b/versions.tf @@ -0,0 +1,18 @@ +terraform { + required_version = ">= 1.3" + + required_providers { + aws = { + source = "hashicorp/aws" + version = ">= 5.0" + } + null = { + source = "hashicorp/null" + version = ">= 3.0" + } + time = { + source = "hashicorp/time" + version = ">= 0.9" + } + } +} From 813f0e92291011eaa1961f37696c5ce9af2b354d Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Tue, 24 Feb 2026 17:40:57 +1000 Subject: [PATCH 02/30] fix: add pull_request trigger to integration tests workflow_dispatch runs don't report as PR checks. Add pull_request with path filters so it triggers on relevant changes and shows on PRs. do_not_enforce_on_create in the ruleset means it won't block PRs that don't touch these paths. Co-Authored-By: Claude Opus 4.6 --- .github/workflows/integration-tests.yml | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/.github/workflows/integration-tests.yml b/.github/workflows/integration-tests.yml index 3d7e9a2..5b3a1f9 100644 --- a/.github/workflows/integration-tests.yml +++ b/.github/workflows/integration-tests.yml @@ -1,6 +1,11 @@ name: Integration Tests on: + pull_request: + paths: + - "*.tf" + - "cmd/lambda/**" + - "tests/**" workflow_dispatch: concurrency: From 9bb725562fdde8e314e6359e5cd41cfe077d2c76 Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Tue, 24 Feb 2026 17:42:26 +1000 Subject: [PATCH 03/30] ci: trigger PR checks From 512b311859c0bb2f05daa7bf25fe282a496606d0 Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Tue, 24 Feb 2026 17:44:56 +1000 Subject: [PATCH 04/30] fix: use label trigger for integration tests Integration tests run when the 'integration-test' label is added to a PR, not on every push. Add the label to trigger, remove and re-add to re-trigger. Also available via workflow_dispatch. Co-Authored-By: Claude Opus 4.6 --- .github/workflows/integration-tests.yml | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/.github/workflows/integration-tests.yml b/.github/workflows/integration-tests.yml index 5b3a1f9..a8ba9d0 100644 --- a/.github/workflows/integration-tests.yml +++ b/.github/workflows/integration-tests.yml @@ -2,10 +2,7 @@ name: Integration Tests on: pull_request: - paths: - - "*.tf" - - "cmd/lambda/**" - - "tests/**" + types: [labeled] workflow_dispatch: concurrency: @@ -18,6 +15,9 @@ permissions: jobs: integration-test: + if: >- + github.event_name == 'workflow_dispatch' || + github.event.label.name == 'integration-test' runs-on: ubuntu-latest timeout-minutes: 15 environment: integration From 1b50b10f744a99bbefc80a9bd9828c6a271046b8 Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Tue, 24 Feb 2026 17:52:21 +1000 Subject: [PATCH 05/30] ci: re-trigger checks From fc69f8e83221b1a5ff6a5e81c497d3f7dda174b3 Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Wed, 25 Feb 2026 07:52:26 +1000 Subject: [PATCH 06/30] docs: rewrite docs, deduplicate terraform-docs, fix integration test AZ - Rewrite README and docs to be more welcoming and compelling - Highlight Go rewrite (90% faster cold starts), real integration tests - Escape $ signs in markdown to prevent GitHub LaTeX rendering - Deduplicate terraform-docs: single source of truth via pre-commit hooks injecting into README.md and replacing docs/REFERENCE.md - Remove docs/README.md (replaced by docs/REFERENCE.md) - Move SECURITY.md and WORKFLOWS.md into docs/ - Add WORKFLOWS.md documenting CI/CD workflows and repo rulesets - Pin integration test fixture to us-east-1a (t4g.nano unsupported in us-east-1e) Co-Authored-By: Claude Opus 4.6 --- .pre-commit-config.yaml | 7 +- .terraform-docs-reference.yml | 5 + .terraform-docs.yml | 9 +- README.md | 221 +++++++++++++++++------------- docs/ARCHITECTURE.md | 14 +- docs/EXAMPLES.md | 2 +- docs/INDEX.md | 200 +++++++++++---------------- docs/PERFORMANCE.md | 36 ++--- docs/README.md | 86 ------------ docs/REFERENCE.md | 2 +- SECURITY.md => docs/SECURITY.md | 0 docs/TESTING.md | 4 +- docs/WORKFLOWS.md | 141 +++++++++++++++++++ tests/integration/fixture/main.tf | 4 + 14 files changed, 392 insertions(+), 339 deletions(-) create mode 100644 .terraform-docs-reference.yml delete mode 100644 docs/README.md rename SECURITY.md => docs/SECURITY.md (100%) create mode 100644 docs/WORKFLOWS.md diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index d0392cb..f6ccb9e 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -40,5 +40,8 @@ repos: rev: "v0.16.0" hooks: - id: terraform-docs-go - name: terraform-docs - args: ["-c", ".terraform-docs.yml", "markdown", "table", "--output-file", "docs/README.md", "."] + name: terraform-docs (README.md) + args: ["--output-mode", "inject", "--output-file", "README.md", "."] + - id: terraform-docs-go + name: terraform-docs (docs/REFERENCE.md) + args: ["-c", ".terraform-docs-reference.yml", "--output-mode", "replace", "--output-file", "docs/REFERENCE.md", "."] diff --git a/.terraform-docs-reference.yml b/.terraform-docs-reference.yml new file mode 100644 index 0000000..f1ba32b --- /dev/null +++ b/.terraform-docs-reference.yml @@ -0,0 +1,5 @@ +formatter: "markdown table" + +output: + template: | + {{ .Content }} diff --git a/.terraform-docs.yml b/.terraform-docs.yml index 3c016eb..8e30c37 100644 --- a/.terraform-docs.yml +++ b/.terraform-docs.yml @@ -1,8 +1 @@ -formatter: "markdown" - -output: - file: "docs/README.md" - mode: replace - template: |- - {{ .Content }} - {{/** End of file fixer */}} +formatter: "markdown table" diff --git a/README.md b/README.md index e342ec1..5a62f85 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,10 @@ # nat-zero -Scale-to-zero NAT instances for AWS. Uses [fck-nat](https://fck-nat.dev/) AMIs. Zero cost when idle. +**Scale-to-zero NAT instances for AWS.** Stop paying for NAT when nothing is running. + +nat-zero is a Terraform module that brings event-driven, scale-to-zero NAT to your AWS VPCs. When a workload starts in a private subnet, a NAT instance spins up automatically. When the last workload stops, the NAT shuts down and its Elastic IP is released. You pay nothing while idle -- just ~\$0.80/mo for a stopped EBS volume. + +Built on [fck-nat](https://fck-nat.dev/) AMIs. Orchestrated by a Go Lambda with a 55 ms cold start. Proven by real integration tests that deploy infrastructure and verify connectivity end-to-end. ``` CONTROL PLANE @@ -26,35 +30,42 @@ Scale-to-zero NAT instances for AWS. Uses [fck-nat](https://fck-nat.dev/) AMIs. └──────────────────┘ ``` -## How It Works +## Why nat-zero? -An EventBridge rule captures all EC2 instance state changes. A Lambda function evaluates each event and manages NAT instance lifecycle per-AZ: +AWS NAT Gateway costs a minimum of ~\$36/month per AZ -- even if nothing is using it. fck-nat brings that down to ~\$7-8/month, but the instance and its public IP still run 24/7. -- **Workload starts** in a private subnet → Lambda starts (or creates) a NAT instance in the same AZ and attaches an Elastic IP -- **Last workload stops** in an AZ → Lambda stops the NAT instance and releases the Elastic IP -- **NAT instance starts** → Lambda attaches an EIP to the public ENI -- **NAT instance stops** → Lambda detaches and releases the EIP +**nat-zero takes it further.** When your private subnets are idle, there's no NAT instance running and no Elastic IP allocated. Your cost drops to the price of a stopped 2 GB EBS volume: about 80 cents a month. -Each NAT instance uses dual ENIs (public + private) pre-created by Terraform. Traffic from private subnets routes through the private ENI, gets masqueraded via iptables, and exits through the public ENI with an Elastic IP. +This matters most for: -See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for detailed diagrams, [docs/PERFORMANCE.md](docs/PERFORMANCE.md) for timing and cost data, and [docs/TEST.md](docs/TEST.md) for integration test documentation. +- **Dev and staging environments** that sit idle nights and weekends +- **CI/CD runners** that spin up for minutes, then disappear for hours +- **Batch and cron workloads** that run periodically +- **Side projects** where every dollar counts -## When To Use This Module +### Cost comparison (per AZ, per month) -| Use Case | This Module | fck-nat | NAT Gateway | -|---|---|---|---| -| Dev/staging with intermittent workloads | **Best fit** | Wasteful | Very wasteful | -| Production 24/7 workloads | Overkill | **Best fit** | Simplest | -| Cost-obsessive environments | **Best fit** | Good | Expensive | -| Simplicity priority | More moving parts | **Simpler** | Simplest | +| State | nat-zero | fck-nat | NAT Gateway | +|-------|----------|---------|-------------| +| **Idle** (no workloads) | **~\$0.80** | ~\$7-8 | ~\$36+ | +| **Active** (workloads running) | ~\$7-8 | ~\$7-8 | ~\$36+ | -**Use this module** when your private subnet workloads run intermittently (CI/CD, dev environments, batch jobs) and you want to pay nothing when idle. +The key: nat-zero **releases the Elastic IP when idle**, avoiding the [\$3.60/month public IPv4 charge](https://aws.amazon.com/blogs/aws/new-aws-public-ipv4-address-charge-public-ip-insights/) that fck-nat and NAT Gateway pay around the clock. -**Use fck-nat** when workloads run 24/7 and you want simplicity with ASG self-healing. +## How it works -**Use NAT Gateway** when you prioritize simplicity and availability over cost. +An EventBridge rule watches for EC2 instance state changes in your VPC. A Lambda function reacts to each event: -## Usage +- **Workload starts** in a private subnet -- Lambda creates (or restarts) a NAT instance in that AZ and attaches an Elastic IP +- **Last workload stops** in an AZ -- Lambda stops the NAT instance and releases the Elastic IP +- **NAT instance reaches "running"** -- Lambda attaches an EIP to the public ENI +- **NAT instance reaches "stopped"** -- Lambda detaches and releases the EIP + +Each NAT instance uses two persistent ENIs (public + private) pre-created by Terraform. They survive stop/start cycles, so route tables stay intact and there's no need to reconfigure anything when a NAT comes back. + +See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for detailed event flows and sequence diagrams. + +## Quick start ```hcl module "nat_zero" { @@ -75,115 +86,137 @@ module "nat_zero" { } ``` -See [`examples/basic/`](examples/basic/) for a complete working example. +See [docs/EXAMPLES.md](docs/EXAMPLES.md) for complete working configurations including spot instances, custom AMIs, and building from source. -## Cost Estimate +## Performance -Per AZ, per month. Accounts for the [AWS public IPv4 charge](https://aws.amazon.com/blogs/aws/new-aws-public-ipv4-address-charge-public-ip-insights/) ($0.005/hr per public IP, effective Feb 2024). +The orchestrator Lambda is written in Go and compiled to a native ARM64 binary. It was rewritten from Python to eliminate cold start overhead -- init latency dropped from 667 ms to 55 ms, a **90% improvement**. Peak memory usage went from 98 MB down to 30 MB. -| State | This Module | fck-nat | NAT Gateway | -|-------|------------|---------|-------------| -| **Idle** (no workloads) | **~$0.80** (EBS only) | ~$7-8 (instance + EIP) | ~$36+ ($32 gw + $3.60 IP) | -| **Active** (workloads running) | ~$7-8 (instance + EBS + EIP) | ~$7-8 (same) | ~$36+ (+ $0.045/GB) | +| Scenario | Time to connectivity | +|----------|---------------------| +| First workload in AZ (cold create) | ~15 seconds | +| NAT already running | Instant | +| Restart from stopped | ~12 seconds | -Key cost difference: this module **releases the EIP when idle**, avoiding the $3.60/mo public IPv4 charge. fck-nat keeps an EIP attached 24/7. +See [docs/PERFORMANCE.md](docs/PERFORMANCE.md) for detailed Lambda execution timings, instance type guidance, and cost breakdowns. -## Startup Latency +## Tested against real infrastructure -| Scenario | Time to Connectivity | -|----------|---------------------| -| First workload in AZ (cold create) | **~15 seconds** | -| NAT already running | **Instant** | -| Restart from stopped (after idle) | **~12 seconds** | +nat-zero isn't just unit-tested -- it's integration-tested against real AWS infrastructure on every PR. The test suite uses [Terratest](https://terratest.gruntwork.io/) to deploy the full module, launch workloads, verify NAT creation and connectivity, exercise scale-down and restart, then tear everything down cleanly. -The first workload instance in an AZ will not have internet access for approximately 15 seconds. Design startup scripts to retry outbound connections. Subsequent instances in the same AZ get connectivity immediately since the route table already points to the running NAT. +See [docs/TESTING.md](docs/TESTING.md) for phase-by-phase documentation. -See [docs/PERFORMANCE.md](docs/PERFORMANCE.md) for detailed timing breakdowns and instance type benchmarks. +## When to use this module -## Important Notes +| Use case | nat-zero | fck-nat | NAT Gateway | +|----------|----------|---------|-------------| +| Dev/staging with intermittent workloads | **Best fit** | Wasteful | Very wasteful | +| Production 24/7 workloads | Overkill | **Best fit** | Simplest | +| Cost-sensitive environments | **Best fit** | Good | Expensive | +| Simplicity priority | More moving parts | **Simpler** | Simplest | + +**Use nat-zero** when your private subnet workloads run intermittently and you want to pay nothing when idle. + +**Use fck-nat** when workloads run 24/7 and you want simplicity with ASG self-healing. + +**Use NAT Gateway** when you prioritize managed simplicity and availability over cost. + +## Important notes -- **EventBridge scope**: The EventBridge rule captures ALL EC2 state changes in the account. The Lambda filters events by VPC ID, so it only acts on instances in the target VPC. -- **EIP behavior**: An Elastic IP is allocated when a NAT instance starts and released when it stops. You are not charged for EIPs while the NAT instance is stopped. -- **fck-nat AMI**: By default, this module uses the public fck-nat AMI (`568608671756`). You can override this with `use_fck_nat_ami = false` and provide `custom_ami_owner` + `custom_ami_name_pattern`, or set `ami_id` directly. -- **Dual ENI**: Each AZ gets a pair of persistent ENIs (public + private). These survive instance stop/start cycles, preserving route table entries. -- **Dead Letter Queue**: Failed Lambda invocations are sent to an SQS DLQ for debugging. +- **EventBridge scope**: The rule captures all EC2 state changes in the account. The Lambda filters by VPC ID, so it only acts on instances in your target VPC. +- **Startup delay**: The first workload in an idle AZ waits ~15 seconds for internet. Design startup scripts to retry outbound connections -- most package managers already do. +- **Dual ENI**: Each AZ gets persistent public + private ENIs that survive instance stop/start cycles. +- **Dead letter queue**: Failed Lambda invocations go to an SQS DLQ for debugging. +- **Clean destroy**: A cleanup action terminates Lambda-created NAT instances before Terraform removes ENIs, ensuring clean `terraform destroy`. + ## Requirements | Name | Version | |------|---------| -| terraform | >= 1.3 | -| aws | >= 5.0 | -| archive | >= 2.0 | +| [terraform](#requirement\_terraform) | >= 1.3 | +| [aws](#requirement\_aws) | >= 5.0 | +| [null](#requirement\_null) | >= 3.0 | +| [time](#requirement\_time) | >= 0.9 | ## Providers | Name | Version | |------|---------| -| aws | >= 5.0 | -| archive | >= 2.0 | +| [aws](#provider\_aws) | >= 5.0 | +| [null](#provider\_null) | >= 3.0 | +| [time](#provider\_time) | >= 0.9 | + +## Modules + +No modules. ## Resources | Name | Type | |------|------| -| aws_cloudwatch_event_rule.ec2_state_change | resource | -| aws_cloudwatch_event_target.state_change_lambda_target | resource | -| aws_cloudwatch_log_group.nat_zero_logs | resource | -| aws_iam_instance_profile.nat_instance_profile | resource | -| aws_iam_role.lambda_iam_role | resource | -| aws_iam_role.nat_instance_role | resource | -| aws_iam_role_policy.lambda_iam_policy | resource | -| aws_iam_role_policy_attachment.lambda_basic_policy_attachment | resource | -| aws_iam_role_policy_attachment.ssm_policy_attachment | resource | -| aws_lambda_function.nat_zero | resource | -| aws_lambda_function_event_invoke_config.nat_zero_invoke_config | resource | -| aws_lambda_permission.allow_ec2_state_change_eventbridge | resource | -| aws_launch_template.nat_launch_template | resource | -| aws_network_interface.nat_private_network_interface | resource | -| aws_network_interface.nat_public_network_interface | resource | -| aws_route.nat_route | resource | -| aws_security_group.nat_security_group | resource | -| aws_sqs_queue.lambda_dlq | resource | -| archive_file.nat_zero | data source | +| [aws_cloudwatch_event_rule.ec2_state_change](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_event_rule) | resource | +| [aws_cloudwatch_event_target.state_change_lambda_target](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_event_target) | resource | +| [aws_cloudwatch_log_group.nat_zero_logs](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_log_group) | resource | +| [aws_iam_instance_profile.nat_instance_profile](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_instance_profile) | resource | +| [aws_iam_role.lambda_iam_role](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role) | resource | +| [aws_iam_role.nat_instance_role](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role) | resource | +| [aws_iam_role_policy.lambda_iam_policy](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy) | resource | +| [aws_iam_role_policy_attachment.ssm_policy_attachment](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy_attachment) | resource | +| [aws_lambda_function.nat_zero](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_function) | resource | +| [aws_lambda_function_event_invoke_config.nat_zero_invoke_config](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_function_event_invoke_config) | resource | +| [aws_lambda_invocation.cleanup](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_invocation) | resource | +| [aws_lambda_permission.allow_ec2_state_change_eventbridge](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_permission) | resource | +| [aws_launch_template.nat_launch_template](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/launch_template) | resource | +| [aws_network_interface.nat_private_network_interface](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/network_interface) | resource | +| [aws_network_interface.nat_public_network_interface](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/network_interface) | resource | +| [aws_route.nat_route](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/route) | resource | +| [aws_security_group.nat_security_group](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/security_group) | resource | +| [null_resource.build_lambda](https://registry.terraform.io/providers/hashicorp/null/latest/docs/resources/resource) | resource | +| [null_resource.download_lambda](https://registry.terraform.io/providers/hashicorp/null/latest/docs/resources/resource) | resource | +| [time_sleep.lambda_ready](https://registry.terraform.io/providers/hashicorp/time/latest/docs/resources/sleep) | resource | ## Inputs | Name | Description | Type | Default | Required | |------|-------------|------|---------|:--------:| -| name | Name prefix for all resources | `string` | n/a | yes | -| vpc_id | VPC ID where NAT instances will be deployed | `string` | n/a | yes | -| availability_zones | List of AZs to deploy NAT instances in | `list(string)` | n/a | yes | -| public_subnets | Public subnet IDs (one per AZ) | `list(string)` | n/a | yes | -| private_subnets | Private subnet IDs (one per AZ) | `list(string)` | n/a | yes | -| private_route_table_ids | Route table IDs for private subnets (one per AZ) | `list(string)` | n/a | yes | -| private_subnets_cidr_blocks | CIDR blocks for private subnets (one per AZ) | `list(string)` | n/a | yes | -| tags | Additional tags for all resources | `map(string)` | `{}` | no | -| instance_type | EC2 instance type for NAT instances | `string` | `"t4g.nano"` | no | -| market_type | `"spot"` or `"on-demand"` | `string` | `"on-demand"` | no | -| block_device_size | Root volume size in GB | `number` | `2` | no | -| use_fck_nat_ami | Use the public fck-nat AMI | `bool` | `true` | no | -| ami_id | Explicit AMI ID (overrides lookup) | `string` | `null` | no | -| custom_ami_owner | AMI owner account when not using fck-nat | `string` | `null` | no | -| custom_ami_name_pattern | AMI name pattern when not using fck-nat | `string` | `null` | no | -| nat_tag_key | Tag key to identify NAT instances | `string` | `"nat-zero:managed"` | no | -| nat_tag_value | Tag value to identify NAT instances | `string` | `"true"` | no | -| ignore_tag_key | Tag key to mark instances the Lambda should ignore | `string` | `"nat-zero:ignore"` | no | -| ignore_tag_value | Tag value to mark instances the Lambda should ignore | `string` | `"true"` | no | -| log_retention_days | CloudWatch log retention in days | `number` | `14` | no | +| [ami\_id](#input\_ami\_id) | Explicit AMI ID to use (overrides AMI lookup entirely) | `string` | `null` | no | +| [availability\_zones](#input\_availability\_zones) | List of availability zones to deploy NAT instances in | `list(string)` | n/a | yes | +| [block\_device\_size](#input\_block\_device\_size) | Size in GB of the root EBS volume | `number` | `10` | no | +| [build\_lambda\_locally](#input\_build\_lambda\_locally) | Build the Lambda binary from Go source instead of downloading a pre-compiled release. Requires Go and zip installed locally. | `bool` | `false` | no | +| [custom\_ami\_name\_pattern](#input\_custom\_ami\_name\_pattern) | AMI name pattern when use\_fck\_nat\_ami is false | `string` | `null` | no | +| [custom\_ami\_owner](#input\_custom\_ami\_owner) | AMI owner account ID when use\_fck\_nat\_ami is false | `string` | `null` | no | +| [enable\_logging](#input\_enable\_logging) | Create a CloudWatch log group for the Lambda function | `bool` | `true` | no | +| [ignore\_tag\_key](#input\_ignore\_tag\_key) | Tag key used to mark instances the Lambda should ignore | `string` | `"nat-zero:ignore"` | no | +| [ignore\_tag\_value](#input\_ignore\_tag\_value) | Tag value used to mark instances the Lambda should ignore | `string` | `"true"` | no | +| [instance\_type](#input\_instance\_type) | Instance type for the NAT instance | `string` | `"t4g.nano"` | no | +| [lambda\_binary\_url](#input\_lambda\_binary\_url) | URL to the pre-compiled Go Lambda zip. Updated automatically by CI. | `string` | `"https://github.com/MachineDotDev/nat-zero/releases/download/nat-zero-lambda-latest/lambda.zip"` | no | +| [lambda\_memory\_size](#input\_lambda\_memory\_size) | Memory allocated to the Lambda function in MB (also scales CPU proportionally) | `number` | `256` | no | +| [log\_retention\_days](#input\_log\_retention\_days) | CloudWatch log retention in days (only used when enable\_logging is true) | `number` | `14` | no | +| [market\_type](#input\_market\_type) | Whether to use spot or on-demand instances | `string` | `"on-demand"` | no | +| [name](#input\_name) | Name prefix for all resources created by this module | `string` | n/a | yes | +| [nat\_tag\_key](#input\_nat\_tag\_key) | Tag key used to identify NAT instances | `string` | `"nat-zero:managed"` | no | +| [nat\_tag\_value](#input\_nat\_tag\_value) | Tag value used to identify NAT instances | `string` | `"true"` | no | +| [private\_route\_table\_ids](#input\_private\_route\_table\_ids) | Route table IDs for the private subnets (one per AZ) | `list(string)` | n/a | yes | +| [private\_subnets](#input\_private\_subnets) | Private subnet IDs (one per AZ) for NAT instance private ENIs | `list(string)` | n/a | yes | +| [private\_subnets\_cidr\_blocks](#input\_private\_subnets\_cidr\_blocks) | CIDR blocks for the private subnets (one per AZ, used in security group rules) | `list(string)` | n/a | yes | +| [public\_subnets](#input\_public\_subnets) | Public subnet IDs (one per AZ) for NAT instance public ENIs | `list(string)` | n/a | yes | +| [tags](#input\_tags) | Additional tags to apply to all resources | `map(string)` | `{}` | no | +| [use\_fck\_nat\_ami](#input\_use\_fck\_nat\_ami) | Use the public fck-nat AMI. Set to false to use a custom AMI. | `bool` | `true` | no | +| [vpc\_id](#input\_vpc\_id) | The VPC ID where NAT instances will be deployed | `string` | n/a | yes | ## Outputs | Name | Description | |------|-------------| -| lambda_function_arn | ARN of the nat-zero Lambda function | -| lambda_function_name | Name of the nat-zero Lambda function | -| nat_security_group_ids | Security group IDs (one per AZ) | -| nat_public_eni_ids | Public ENI IDs (one per AZ) | -| nat_private_eni_ids | Private ENI IDs (one per AZ) | -| launch_template_ids | Launch template IDs (one per AZ) | -| eventbridge_rule_arn | ARN of the EventBridge rule | -| dlq_arn | ARN of the dead letter queue | +| [eventbridge\_rule\_arn](#output\_eventbridge\_rule\_arn) | ARN of the EventBridge rule capturing EC2 state changes | +| [lambda\_function\_arn](#output\_lambda\_function\_arn) | ARN of the nat-zero Lambda function | +| [lambda\_function\_name](#output\_lambda\_function\_name) | Name of the nat-zero Lambda function | +| [launch\_template\_ids](#output\_launch\_template\_ids) | Launch template IDs for NAT instances (one per AZ) | +| [nat\_private\_eni\_ids](#output\_nat\_private\_eni\_ids) | Private ENI IDs for NAT instances (one per AZ) | +| [nat\_public\_eni\_ids](#output\_nat\_public\_eni\_ids) | Public ENI IDs for NAT instances (one per AZ) | +| [nat\_security\_group\_ids](#output\_nat\_security\_group\_ids) | Security group IDs for NAT instances (one per AZ) | + ## Contributing diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index 32ae603..31ab0af 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -2,7 +2,9 @@ ## High-Level Overview -The nat-zero module provides event-driven, scale-to-zero NAT instances for AWS. It uses EventBridge to capture EC2 instance state changes and a Lambda function to orchestrate the NAT instance lifecycle. +nat-zero takes a fundamentally different approach to NAT on AWS. Instead of running infrastructure around the clock, it treats NAT as a **reactive service**: infrastructure that exists only when something needs it. + +The module deploys an EventBridge rule that watches for EC2 state changes, and a Go Lambda that orchestrates NAT instance lifecycles in response. No polling, no cron jobs, no always-on compute -- just event-driven reactions to what's actually happening in your VPC. ``` DATA PLANE @@ -237,7 +239,7 @@ Key design decisions: ## Comparison with fck-nat -This module complements fck-nat by adding scale-to-zero capability. +nat-zero builds on top of fck-nat -- it uses the same AMI and the same iptables-based NAT approach. The difference is the orchestration layer: instead of an always-on ASG, nat-zero uses event-driven Lambda to start and stop NAT instances on demand. ``` fck-nat (Always-On) nat-zero (Scale-to-Zero) @@ -268,14 +270,14 @@ This module complements fck-nat by adding scale-to-zero capability. └────────────────────────────────┘ ``` -Costs per AZ, per month. Includes the [AWS public IPv4 charge](https://aws.amazon.com/blogs/aws/new-aws-public-ipv4-address-charge-public-ip-insights/) ($3.60/mo per public IP, effective Feb 2024). +Costs per AZ, per month. Includes the [AWS public IPv4 charge](https://aws.amazon.com/blogs/aws/new-aws-public-ipv4-address-charge-public-ip-insights/) (\$3.60/mo per public IP, effective Feb 2024). | Aspect | fck-nat | nat-zero | |--------|---------|-------------------| | Architecture | ASG with min=1 | Lambda + EventBridge | -| Idle cost | ~$7-8/mo (instance + EIP 24/7) | ~$0.80/mo (EBS only, no EIP) | -| Active cost | ~$7-8/mo | ~$7-8/mo (same) | -| Public IPv4 charge | $3.60/mo always | $0 when idle (EIP released) | +| Idle cost | ~\$7-8/mo (instance + EIP 24/7) | ~\$0.80/mo (EBS only, no EIP) | +| Active cost | ~\$7-8/mo | ~\$7-8/mo (same) | +| Public IPv4 charge | \$3.60/mo always | \$0 when idle (EIP released) | | Scale-to-zero | No | Yes | | Self-healing | ASG replaces unhealthy | Lambda creates new on demand | | AMI | fck-nat AMI | fck-nat AMI (same) | diff --git a/docs/EXAMPLES.md b/docs/EXAMPLES.md index 8d74007..d182de4 100644 --- a/docs/EXAMPLES.md +++ b/docs/EXAMPLES.md @@ -2,7 +2,7 @@ ## Basic Usage -A complete working example that creates a VPC with public and private subnets, then deploys nat-zero to provide scale-to-zero NAT for the private subnets. +The simplest way to get started: create a VPC with public and private subnets, then drop in nat-zero. Your private subnets get internet access when workloads are running, and you pay nothing when they're not. ```hcl terraform { diff --git a/docs/INDEX.md b/docs/INDEX.md index e342ec1..1666240 100644 --- a/docs/INDEX.md +++ b/docs/INDEX.md @@ -1,6 +1,10 @@ # nat-zero -Scale-to-zero NAT instances for AWS. Uses [fck-nat](https://fck-nat.dev/) AMIs. Zero cost when idle. +**Scale-to-zero NAT instances for AWS.** Stop paying for NAT when nothing is running. + +nat-zero is a Terraform module that brings event-driven, scale-to-zero NAT to your AWS VPCs. When a workload starts in a private subnet, a NAT instance spins up automatically. When the last workload stops, the NAT shuts down and its Elastic IP is released. You pay nothing while idle -- just ~\$0.80/mo for a stopped EBS volume. + +Built on [fck-nat](https://fck-nat.dev/) AMIs. Orchestrated by a Go Lambda with a 55 ms cold start. Proven by real integration tests that deploy infrastructure and verify connectivity end-to-end. ``` CONTROL PLANE @@ -26,35 +30,42 @@ Scale-to-zero NAT instances for AWS. Uses [fck-nat](https://fck-nat.dev/) AMIs. └──────────────────┘ ``` -## How It Works +## Why nat-zero? -An EventBridge rule captures all EC2 instance state changes. A Lambda function evaluates each event and manages NAT instance lifecycle per-AZ: +AWS NAT Gateway costs a minimum of ~\$36/month per AZ -- even if nothing is using it. fck-nat brings that down to ~\$7-8/month, but the instance and its public IP still run 24/7. -- **Workload starts** in a private subnet → Lambda starts (or creates) a NAT instance in the same AZ and attaches an Elastic IP -- **Last workload stops** in an AZ → Lambda stops the NAT instance and releases the Elastic IP -- **NAT instance starts** → Lambda attaches an EIP to the public ENI -- **NAT instance stops** → Lambda detaches and releases the EIP +**nat-zero takes it further.** When your private subnets are idle, there's no NAT instance running and no Elastic IP allocated. Your cost drops to the price of a stopped 2 GB EBS volume: about 80 cents a month. -Each NAT instance uses dual ENIs (public + private) pre-created by Terraform. Traffic from private subnets routes through the private ENI, gets masqueraded via iptables, and exits through the public ENI with an Elastic IP. +This matters most for: -See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for detailed diagrams, [docs/PERFORMANCE.md](docs/PERFORMANCE.md) for timing and cost data, and [docs/TEST.md](docs/TEST.md) for integration test documentation. +- **Dev and staging environments** that sit idle nights and weekends +- **CI/CD runners** that spin up for minutes, then disappear for hours +- **Batch and cron workloads** that run periodically +- **Side projects** where every dollar counts -## When To Use This Module +### Cost comparison (per AZ, per month) -| Use Case | This Module | fck-nat | NAT Gateway | -|---|---|---|---| -| Dev/staging with intermittent workloads | **Best fit** | Wasteful | Very wasteful | -| Production 24/7 workloads | Overkill | **Best fit** | Simplest | -| Cost-obsessive environments | **Best fit** | Good | Expensive | -| Simplicity priority | More moving parts | **Simpler** | Simplest | +| State | nat-zero | fck-nat | NAT Gateway | +|-------|----------|---------|-------------| +| **Idle** (no workloads) | **~\$0.80** | ~\$7-8 | ~\$36+ | +| **Active** (workloads running) | ~\$7-8 | ~\$7-8 | ~\$36+ | -**Use this module** when your private subnet workloads run intermittently (CI/CD, dev environments, batch jobs) and you want to pay nothing when idle. +The key: nat-zero **releases the Elastic IP when idle**, avoiding the [\$3.60/month public IPv4 charge](https://aws.amazon.com/blogs/aws/new-aws-public-ipv4-address-charge-public-ip-insights/) that fck-nat and NAT Gateway pay around the clock. -**Use fck-nat** when workloads run 24/7 and you want simplicity with ASG self-healing. +## How it works + +An EventBridge rule watches for EC2 instance state changes in your VPC. A Lambda function reacts to each event: -**Use NAT Gateway** when you prioritize simplicity and availability over cost. +- **Workload starts** in a private subnet -- Lambda creates (or restarts) a NAT instance in that AZ and attaches an Elastic IP +- **Last workload stops** in an AZ -- Lambda stops the NAT instance and releases the Elastic IP +- **NAT instance reaches "running"** -- Lambda attaches an EIP to the public ENI +- **NAT instance reaches "stopped"** -- Lambda detaches and releases the EIP -## Usage +Each NAT instance uses two persistent ENIs (public + private) pre-created by Terraform. They survive stop/start cycles, so route tables stay intact and there's no need to reconfigure anything when a NAT comes back. + +See [Architecture](ARCHITECTURE.md) for detailed event flows and sequence diagrams. + +## Quick start ```hcl module "nat_zero" { @@ -75,115 +86,58 @@ module "nat_zero" { } ``` -See [`examples/basic/`](examples/basic/) for a complete working example. +See [Examples](EXAMPLES.md) for complete working configurations including spot instances, custom AMIs, and building from source. + +## Performance -## Cost Estimate +The orchestrator Lambda is written in Go and compiled to a native ARM64 binary. It was rewritten from Python to eliminate cold start overhead -- init latency dropped from 667 ms to 55 ms, a **90% improvement**. Peak memory usage went from 98 MB down to 30 MB. -Per AZ, per month. Accounts for the [AWS public IPv4 charge](https://aws.amazon.com/blogs/aws/new-aws-public-ipv4-address-charge-public-ip-insights/) ($0.005/hr per public IP, effective Feb 2024). +| Scenario | Time to connectivity | +|----------|---------------------| +| First workload in AZ (cold create) | ~15 seconds | +| NAT already running | Instant | +| Restart from stopped | ~12 seconds | -| State | This Module | fck-nat | NAT Gateway | -|-------|------------|---------|-------------| -| **Idle** (no workloads) | **~$0.80** (EBS only) | ~$7-8 (instance + EIP) | ~$36+ ($32 gw + $3.60 IP) | -| **Active** (workloads running) | ~$7-8 (instance + EBS + EIP) | ~$7-8 (same) | ~$36+ (+ $0.045/GB) | +The ~15 second cold-create time is dominated by EC2 instance boot and fck-nat AMI configuration -- not the Lambda. Subsequent workloads in the same AZ get connectivity immediately since the route table already points to the running NAT. -Key cost difference: this module **releases the EIP when idle**, avoiding the $3.60/mo public IPv4 charge. fck-nat keeps an EIP attached 24/7. +See [Performance](PERFORMANCE.md) for detailed Lambda execution timings, instance type guidance, and cost breakdowns. -## Startup Latency +## Tested against real infrastructure -| Scenario | Time to Connectivity | -|----------|---------------------| -| First workload in AZ (cold create) | **~15 seconds** | -| NAT already running | **Instant** | -| Restart from stopped (after idle) | **~12 seconds** | - -The first workload instance in an AZ will not have internet access for approximately 15 seconds. Design startup scripts to retry outbound connections. Subsequent instances in the same AZ get connectivity immediately since the route table already points to the running NAT. - -See [docs/PERFORMANCE.md](docs/PERFORMANCE.md) for detailed timing breakdowns and instance type benchmarks. - -## Important Notes - -- **EventBridge scope**: The EventBridge rule captures ALL EC2 state changes in the account. The Lambda filters events by VPC ID, so it only acts on instances in the target VPC. -- **EIP behavior**: An Elastic IP is allocated when a NAT instance starts and released when it stops. You are not charged for EIPs while the NAT instance is stopped. -- **fck-nat AMI**: By default, this module uses the public fck-nat AMI (`568608671756`). You can override this with `use_fck_nat_ami = false` and provide `custom_ami_owner` + `custom_ami_name_pattern`, or set `ami_id` directly. -- **Dual ENI**: Each AZ gets a pair of persistent ENIs (public + private). These survive instance stop/start cycles, preserving route table entries. -- **Dead Letter Queue**: Failed Lambda invocations are sent to an SQS DLQ for debugging. - -## Requirements - -| Name | Version | -|------|---------| -| terraform | >= 1.3 | -| aws | >= 5.0 | -| archive | >= 2.0 | - -## Providers - -| Name | Version | -|------|---------| -| aws | >= 5.0 | -| archive | >= 2.0 | - -## Resources - -| Name | Type | -|------|------| -| aws_cloudwatch_event_rule.ec2_state_change | resource | -| aws_cloudwatch_event_target.state_change_lambda_target | resource | -| aws_cloudwatch_log_group.nat_zero_logs | resource | -| aws_iam_instance_profile.nat_instance_profile | resource | -| aws_iam_role.lambda_iam_role | resource | -| aws_iam_role.nat_instance_role | resource | -| aws_iam_role_policy.lambda_iam_policy | resource | -| aws_iam_role_policy_attachment.lambda_basic_policy_attachment | resource | -| aws_iam_role_policy_attachment.ssm_policy_attachment | resource | -| aws_lambda_function.nat_zero | resource | -| aws_lambda_function_event_invoke_config.nat_zero_invoke_config | resource | -| aws_lambda_permission.allow_ec2_state_change_eventbridge | resource | -| aws_launch_template.nat_launch_template | resource | -| aws_network_interface.nat_private_network_interface | resource | -| aws_network_interface.nat_public_network_interface | resource | -| aws_route.nat_route | resource | -| aws_security_group.nat_security_group | resource | -| aws_sqs_queue.lambda_dlq | resource | -| archive_file.nat_zero | data source | - -## Inputs - -| Name | Description | Type | Default | Required | -|------|-------------|------|---------|:--------:| -| name | Name prefix for all resources | `string` | n/a | yes | -| vpc_id | VPC ID where NAT instances will be deployed | `string` | n/a | yes | -| availability_zones | List of AZs to deploy NAT instances in | `list(string)` | n/a | yes | -| public_subnets | Public subnet IDs (one per AZ) | `list(string)` | n/a | yes | -| private_subnets | Private subnet IDs (one per AZ) | `list(string)` | n/a | yes | -| private_route_table_ids | Route table IDs for private subnets (one per AZ) | `list(string)` | n/a | yes | -| private_subnets_cidr_blocks | CIDR blocks for private subnets (one per AZ) | `list(string)` | n/a | yes | -| tags | Additional tags for all resources | `map(string)` | `{}` | no | -| instance_type | EC2 instance type for NAT instances | `string` | `"t4g.nano"` | no | -| market_type | `"spot"` or `"on-demand"` | `string` | `"on-demand"` | no | -| block_device_size | Root volume size in GB | `number` | `2` | no | -| use_fck_nat_ami | Use the public fck-nat AMI | `bool` | `true` | no | -| ami_id | Explicit AMI ID (overrides lookup) | `string` | `null` | no | -| custom_ami_owner | AMI owner account when not using fck-nat | `string` | `null` | no | -| custom_ami_name_pattern | AMI name pattern when not using fck-nat | `string` | `null` | no | -| nat_tag_key | Tag key to identify NAT instances | `string` | `"nat-zero:managed"` | no | -| nat_tag_value | Tag value to identify NAT instances | `string` | `"true"` | no | -| ignore_tag_key | Tag key to mark instances the Lambda should ignore | `string` | `"nat-zero:ignore"` | no | -| ignore_tag_value | Tag value to mark instances the Lambda should ignore | `string` | `"true"` | no | -| log_retention_days | CloudWatch log retention in days | `number` | `14` | no | - -## Outputs - -| Name | Description | -|------|-------------| -| lambda_function_arn | ARN of the nat-zero Lambda function | -| lambda_function_name | Name of the nat-zero Lambda function | -| nat_security_group_ids | Security group IDs (one per AZ) | -| nat_public_eni_ids | Public ENI IDs (one per AZ) | -| nat_private_eni_ids | Private ENI IDs (one per AZ) | -| launch_template_ids | Launch template IDs (one per AZ) | -| eventbridge_rule_arn | ARN of the EventBridge rule | -| dlq_arn | ARN of the dead letter queue | +nat-zero isn't just unit-tested -- it's integration-tested against real AWS infrastructure on every PR. The test suite uses [Terratest](https://terratest.gruntwork.io/) to: + +1. Deploy the full module (Lambda, EventBridge, ENIs, security groups, launch templates) +2. Launch a workload instance and verify NAT creation with EIP +3. Verify the workload's egress IP matches the NAT's Elastic IP +4. Terminate the workload and verify NAT scale-down and EIP release +5. Launch a new workload and verify NAT restart +6. Run the cleanup action and verify all resources are removed +7. Tear down everything with `terraform destroy` + +The full lifecycle takes about 5 minutes in CI. See [Testing](TESTING.md) for phase-by-phase documentation. + +## When to use this module + +| Use case | nat-zero | fck-nat | NAT Gateway | +|----------|----------|---------|-------------| +| Dev/staging with intermittent workloads | **Best fit** | Wasteful | Very wasteful | +| Production 24/7 workloads | Overkill | **Best fit** | Simplest | +| Cost-sensitive environments | **Best fit** | Good | Expensive | +| Simplicity priority | More moving parts | **Simpler** | Simplest | + +**Use nat-zero** when your private subnet workloads run intermittently and you want to pay nothing when idle. + +**Use fck-nat** when workloads run 24/7 and you want simplicity with ASG self-healing. + +**Use NAT Gateway** when you prioritize managed simplicity and availability over cost. + +## Important notes + +- **EventBridge scope**: The rule captures all EC2 state changes in the account. The Lambda filters by VPC ID, so it only acts on instances in your target VPC. +- **Startup delay**: The first workload in an idle AZ waits ~15 seconds for internet. Design startup scripts to retry outbound connections -- most package managers already do. +- **Dual ENI**: Each AZ gets persistent public + private ENIs that survive instance stop/start cycles. +- **Dead letter queue**: Failed Lambda invocations go to an SQS DLQ for debugging. +- **Clean destroy**: A cleanup action terminates Lambda-created NAT instances before Terraform removes ENIs, ensuring clean `terraform destroy`. ## Contributing diff --git a/docs/PERFORMANCE.md b/docs/PERFORMANCE.md index 2a757e1..9280d50 100644 --- a/docs/PERFORMANCE.md +++ b/docs/PERFORMANCE.md @@ -1,6 +1,6 @@ # Performance and Cost -Startup latency, operational timing, instance type guidance, and cost comparisons for the nat-zero module. All measurements from integration tests running in us-east-1 with `t4g.nano` instances. +nat-zero's orchestrator Lambda was rewritten from Python 3.11 to Go, compiled to a native ARM64 binary running on the `provided.al2023` runtime. The result: **90% faster cold starts**, 69% less memory, and faster end-to-end execution. All measurements below are from real integration tests running in us-east-1 with `t4g.nano` instances. ## Startup Latency @@ -95,9 +95,11 @@ The Lambda is a compiled Go binary on the `provided.al2023` runtime with 256 MB | **attachEIP handler total** | **~0.5 s** | classify + waitForState + attachEIP | | **detachEIP handler total** | **~0.5 s** | classify + waitForState + detachEIP | -### Comparison with Python Lambda +### Why Go? -The previous Python implementation used the `python3.11` runtime with 128 MB memory. +The original Lambda was written in Python 3.11. It worked, but Python's interpreter overhead meant a 667 ms cold start and 98 MB memory footprint -- meaningful for a function that might be invoked dozens of times during a busy scaling period. + +Rewriting in Go and compiling to a native binary eliminated the interpreter entirely: | Metric | Python 3.11 (128 MB) | Go (256 MB) | Improvement | |--------|----------------------|-------------|-------------| @@ -105,6 +107,8 @@ The previous Python implementation used the `python3.11` runtime with 128 MB mem | Handler total (scale-up) | 2,439 ms | ~2,000 ms | **~18% faster** | | Max memory used | 98 MB | 30 MB | **69% less** | +The Go binary is ~4 MB, boots in under 70 ms, and the entire scale-up path completes in about 2 seconds. For a Lambda that runs on every EC2 state change in your account, that matters. + ## What This Means for Your Workloads - **First workload takes ~15 seconds to get internet.** Design startup scripts to retry outbound connections (e.g. `apt update`, `pip install`, `curl`). Most package managers already retry. @@ -115,29 +119,29 @@ The previous Python implementation used the `python3.11` runtime with 128 MB mem ## Cost -Per AZ, per month. All prices are us-east-1 on-demand. Includes the [AWS public IPv4 charge](https://aws.amazon.com/blogs/aws/new-aws-public-ipv4-address-charge-public-ip-insights/) ($0.005/hr per public IP). +Per AZ, per month. All prices are us-east-1 on-demand. Includes the [AWS public IPv4 charge](https://aws.amazon.com/blogs/aws/new-aws-public-ipv4-address-charge-public-ip-insights/) (\$0.005/hr per public IP). ### Idle vs active | State | nat-zero | fck-nat | NAT Gateway | |-------|----------|---------|-------------| -| **Idle** (no workloads) | **~$0.80** | ~$7-8 | ~$36+ | -| **Active** (workloads running) | ~$7-8 | ~$7-8 | ~$36+ | +| **Idle** (no workloads) | **~\$0.80** | ~\$7-8 | ~\$36+ | +| **Active** (workloads running) | ~\$7-8 | ~\$7-8 | ~\$36+ | -**Idle breakdown**: EBS volume only (~$0.80/mo for 2 GB gp3). No instance running, no EIP allocated. +**Idle breakdown**: EBS volume only (~\$0.80/mo for 2 GB gp3). No instance running, no EIP allocated. -**Active breakdown**: t4g.nano instance ($3.07/mo) + EIP ($3.60/mo) + EBS ($0.80/mo) = ~$7.50/mo. +**Active breakdown**: t4g.nano instance (\$3.07/mo) + EIP (\$3.60/mo) + EBS (\$0.80/mo) = ~\$7.50/mo. -The key difference: nat-zero **releases the EIP when idle**, saving the $3.60/mo public IPv4 charge that fck-nat and NAT Gateway pay 24/7. +The key difference: nat-zero **releases the EIP when idle**, saving the \$3.60/mo public IPv4 charge that fck-nat and NAT Gateway pay 24/7. ### Instance type options -| Instance Type | vCPUs | RAM | Network | $/hour | $/month (24x7) | $/month (12hr/day) | +| Instance Type | vCPUs | RAM | Network | \$/hour | \$/month (24x7) | \$/month (12hr/day) | |---------------|-------|-----|---------|--------|---------------|-------------------| -| **t4g.nano** (default) | 2 | 0.5 GiB | Up to 5 Gbps | $0.0042 | $3.07 | $1.53 | -| t4g.micro | 2 | 1 GiB | Up to 5 Gbps | $0.0084 | $6.13 | $3.07 | -| t4g.small | 2 | 2 GiB | Up to 5 Gbps | $0.0168 | $12.26 | $6.13 | -| c7gn.medium | 1 | 2 GiB | Up to 25 Gbps | $0.0624 | $45.55 | $22.78 | +| **t4g.nano** (default) | 2 | 0.5 GiB | Up to 5 Gbps | \$0.0042 | \$3.07 | \$1.53 | +| t4g.micro | 2 | 1 GiB | Up to 5 Gbps | \$0.0084 | \$6.13 | \$3.07 | +| t4g.small | 2 | 2 GiB | Up to 5 Gbps | \$0.0168 | \$12.26 | \$6.13 | +| c7gn.medium | 1 | 2 GiB | Up to 25 Gbps | \$0.0624 | \$45.55 | \$22.78 | Spot pricing typically offers 60-70% savings on t4g instances. Use `market_type = "spot"` to enable. @@ -146,10 +150,10 @@ Spot pricing typically offers 60-70% savings on t4g instances. Use `market_type **t4g.nano** (default) is right for most workloads: - Handles typical dev/staging NAT traffic - Burstable up to 5 Gbps with CPU credits -- $3/month on-demand, ~$1/month on spot +- \$3/month on-demand, ~\$1/month on spot **t4g.micro / t4g.small** — consider if you need sustained throughput beyond t4g.nano's baseline or workloads transfer large volumes consistently. -**c7gn.medium** — consider if you need consistently high network throughput (up to 25 Gbps). At $45/month it's still cheaper than NAT Gateway for most data transfer patterns. +**c7gn.medium** — consider if you need consistently high network throughput (up to 25 Gbps). At \$45/month it's still cheaper than NAT Gateway for most data transfer patterns. Instance type does **not** affect startup time (~12 s regardless), only maximum sustained throughput and monthly cost. diff --git a/docs/README.md b/docs/README.md deleted file mode 100644 index 30791e4..0000000 --- a/docs/README.md +++ /dev/null @@ -1,86 +0,0 @@ -## Requirements - -| Name | Version | -|------|---------| -| [terraform](#requirement\_terraform) | >= 1.3 | -| [aws](#requirement\_aws) | >= 5.0 | -| [null](#requirement\_null) | >= 3.0 | -| [time](#requirement\_time) | >= 0.9 | - -## Providers - -| Name | Version | -|------|---------| -| [aws](#provider\_aws) | >= 5.0 | -| [null](#provider\_null) | >= 3.0 | -| [time](#provider\_time) | >= 0.9 | - -## Modules - -No modules. - -## Resources - -| Name | Type | -|------|------| -| [aws_cloudwatch_event_rule.ec2_state_change](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_event_rule) | resource | -| [aws_cloudwatch_event_target.state_change_lambda_target](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_event_target) | resource | -| [aws_cloudwatch_log_group.nat_zero_logs](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_log_group) | resource | -| [aws_iam_instance_profile.nat_instance_profile](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_instance_profile) | resource | -| [aws_iam_role.lambda_iam_role](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role) | resource | -| [aws_iam_role.nat_instance_role](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role) | resource | -| [aws_iam_role_policy.lambda_iam_policy](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy) | resource | -| [aws_iam_role_policy_attachment.ssm_policy_attachment](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy_attachment) | resource | -| [aws_lambda_function.nat_zero](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_function) | resource | -| [aws_lambda_function_event_invoke_config.nat_zero_invoke_config](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_function_event_invoke_config) | resource | -| [aws_lambda_invocation.cleanup](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_invocation) | resource | -| [aws_lambda_permission.allow_ec2_state_change_eventbridge](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_permission) | resource | -| [aws_launch_template.nat_launch_template](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/launch_template) | resource | -| [aws_network_interface.nat_private_network_interface](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/network_interface) | resource | -| [aws_network_interface.nat_public_network_interface](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/network_interface) | resource | -| [aws_route.nat_route](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/route) | resource | -| [aws_security_group.nat_security_group](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/security_group) | resource | -| [null_resource.build_lambda](https://registry.terraform.io/providers/hashicorp/null/latest/docs/resources/resource) | resource | -| [null_resource.download_lambda](https://registry.terraform.io/providers/hashicorp/null/latest/docs/resources/resource) | resource | -| [time_sleep.lambda_ready](https://registry.terraform.io/providers/hashicorp/time/latest/docs/resources/sleep) | resource | - -## Inputs - -| Name | Description | Type | Default | Required | -|------|-------------|------|---------|:--------:| -| [ami\_id](#input\_ami\_id) | Explicit AMI ID to use (overrides AMI lookup entirely) | `string` | `null` | no | -| [availability\_zones](#input\_availability\_zones) | List of availability zones to deploy NAT instances in | `list(string)` | n/a | yes | -| [block\_device\_size](#input\_block\_device\_size) | Size in GB of the root EBS volume | `number` | `10` | no | -| [build\_lambda\_locally](#input\_build\_lambda\_locally) | Build the Lambda binary from Go source instead of downloading a pre-compiled release. Requires Go and zip installed locally. | `bool` | `false` | no | -| [custom\_ami\_name\_pattern](#input\_custom\_ami\_name\_pattern) | AMI name pattern when use\_fck\_nat\_ami is false | `string` | `null` | no | -| [custom\_ami\_owner](#input\_custom\_ami\_owner) | AMI owner account ID when use\_fck\_nat\_ami is false | `string` | `null` | no | -| [enable\_logging](#input\_enable\_logging) | Create a CloudWatch log group for the Lambda function | `bool` | `true` | no | -| [ignore\_tag\_key](#input\_ignore\_tag\_key) | Tag key used to mark instances the Lambda should ignore | `string` | `"nat-zero:ignore"` | no | -| [ignore\_tag\_value](#input\_ignore\_tag\_value) | Tag value used to mark instances the Lambda should ignore | `string` | `"true"` | no | -| [instance\_type](#input\_instance\_type) | Instance type for the NAT instance | `string` | `"t4g.nano"` | no | -| [lambda\_binary\_url](#input\_lambda\_binary\_url) | URL to the pre-compiled Go Lambda zip. Updated automatically by CI. | `string` | `"https://github.com/MachineDotDev/nat-zero/releases/download/nat-zero-lambda-latest/lambda.zip"` | no | -| [lambda\_memory\_size](#input\_lambda\_memory\_size) | Memory allocated to the Lambda function in MB (also scales CPU proportionally) | `number` | `256` | no | -| [log\_retention\_days](#input\_log\_retention\_days) | CloudWatch log retention in days (only used when enable\_logging is true) | `number` | `14` | no | -| [market\_type](#input\_market\_type) | Whether to use spot or on-demand instances | `string` | `"on-demand"` | no | -| [name](#input\_name) | Name prefix for all resources created by this module | `string` | n/a | yes | -| [nat\_tag\_key](#input\_nat\_tag\_key) | Tag key used to identify NAT instances | `string` | `"nat-zero:managed"` | no | -| [nat\_tag\_value](#input\_nat\_tag\_value) | Tag value used to identify NAT instances | `string` | `"true"` | no | -| [private\_route\_table\_ids](#input\_private\_route\_table\_ids) | Route table IDs for the private subnets (one per AZ) | `list(string)` | n/a | yes | -| [private\_subnets](#input\_private\_subnets) | Private subnet IDs (one per AZ) for NAT instance private ENIs | `list(string)` | n/a | yes | -| [private\_subnets\_cidr\_blocks](#input\_private\_subnets\_cidr\_blocks) | CIDR blocks for the private subnets (one per AZ, used in security group rules) | `list(string)` | n/a | yes | -| [public\_subnets](#input\_public\_subnets) | Public subnet IDs (one per AZ) for NAT instance public ENIs | `list(string)` | n/a | yes | -| [tags](#input\_tags) | Additional tags to apply to all resources | `map(string)` | `{}` | no | -| [use\_fck\_nat\_ami](#input\_use\_fck\_nat\_ami) | Use the public fck-nat AMI. Set to false to use a custom AMI. | `bool` | `true` | no | -| [vpc\_id](#input\_vpc\_id) | The VPC ID where NAT instances will be deployed | `string` | n/a | yes | - -## Outputs - -| Name | Description | -|------|-------------| -| [eventbridge\_rule\_arn](#output\_eventbridge\_rule\_arn) | ARN of the EventBridge rule capturing EC2 state changes | -| [lambda\_function\_arn](#output\_lambda\_function\_arn) | ARN of the nat-zero Lambda function | -| [lambda\_function\_name](#output\_lambda\_function\_name) | Name of the nat-zero Lambda function | -| [launch\_template\_ids](#output\_launch\_template\_ids) | Launch template IDs for NAT instances (one per AZ) | -| [nat\_private\_eni\_ids](#output\_nat\_private\_eni\_ids) | Private ENI IDs for NAT instances (one per AZ) | -| [nat\_public\_eni\_ids](#output\_nat\_public\_eni\_ids) | Public ENI IDs for NAT instances (one per AZ) | -| [nat\_security\_group\_ids](#output\_nat\_security\_group\_ids) | Security group IDs for NAT instances (one per AZ) | diff --git a/docs/REFERENCE.md b/docs/REFERENCE.md index 0cd14ab..30791e4 100644 --- a/docs/REFERENCE.md +++ b/docs/REFERENCE.md @@ -58,7 +58,7 @@ No modules. | [ignore\_tag\_key](#input\_ignore\_tag\_key) | Tag key used to mark instances the Lambda should ignore | `string` | `"nat-zero:ignore"` | no | | [ignore\_tag\_value](#input\_ignore\_tag\_value) | Tag value used to mark instances the Lambda should ignore | `string` | `"true"` | no | | [instance\_type](#input\_instance\_type) | Instance type for the NAT instance | `string` | `"t4g.nano"` | no | -| [lambda\_binary\_url](#input\_lambda\_binary\_url) | URL to the pre-compiled Go Lambda zip. Updated automatically by CI. | `string` | `"https://github.com/MachineDotDev/terraform-modules/releases/download/nat-zero-lambda-latest/lambda.zip"` | no | +| [lambda\_binary\_url](#input\_lambda\_binary\_url) | URL to the pre-compiled Go Lambda zip. Updated automatically by CI. | `string` | `"https://github.com/MachineDotDev/nat-zero/releases/download/nat-zero-lambda-latest/lambda.zip"` | no | | [lambda\_memory\_size](#input\_lambda\_memory\_size) | Memory allocated to the Lambda function in MB (also scales CPU proportionally) | `number` | `256` | no | | [log\_retention\_days](#input\_log\_retention\_days) | CloudWatch log retention in days (only used when enable\_logging is true) | `number` | `14` | no | | [market\_type](#input\_market\_type) | Whether to use spot or on-demand instances | `string` | `"on-demand"` | no | diff --git a/SECURITY.md b/docs/SECURITY.md similarity index 100% rename from SECURITY.md rename to docs/SECURITY.md diff --git a/docs/TESTING.md b/docs/TESTING.md index 790ad8b..61d1f28 100644 --- a/docs/TESTING.md +++ b/docs/TESTING.md @@ -1,8 +1,8 @@ # Integration Tests -The integration tests live in `tests/integration/` and use [Terratest](https://terratest.gruntwork.io/) (Go) to deploy real AWS infrastructure, exercise the Lambda, and tear it down. +nat-zero is tested against real AWS infrastructure, not mocks. The integration test suite deploys the full module into a live AWS account, launches actual EC2 workloads, verifies that NAT instances come up with working internet connectivity, exercises scale-down and restart, then tears everything down cleanly. -They run in CI via the `terratest` GitHub Actions job against `us-east-1`. +These tests run in CI on every PR (triggered by adding the `integration-test` label) and take about 5 minutes end-to-end. They use [Terratest](https://terratest.gruntwork.io/) (Go) and run against `us-east-1`. ## Test Fixture diff --git a/docs/WORKFLOWS.md b/docs/WORKFLOWS.md new file mode 100644 index 0000000..333c289 --- /dev/null +++ b/docs/WORKFLOWS.md @@ -0,0 +1,141 @@ +# CI/CD Workflows + +Internal reference for GitHub Actions workflows, repo rulesets, and the release process. This file is not published to the docs site. + +## Workflows Overview + +| Workflow | File | Triggers | Required Check | +|----------|------|----------|----------------| +| Pre-commit | `precommit.yml` | All PRs; push to `main` (filtered paths) | `precommit` | +| Go Tests | `go-tests.yml` | PRs touching `cmd/lambda/**`; push to `main` | `go-test` | +| Integration Tests | `integration-tests.yml` | PR labeled `integration-test`; manual dispatch | `integration-test` | +| Docs | `docs.yml` | Push to `main` (filtered paths) | No (post-merge deploy) | +| Release | `release-please.yml` | Push to `main`; manual dispatch | No (post-merge) | + +## Pre-commit (`precommit.yml`) + +Runs the repo's `.pre-commit-config.yaml` hooks: terraform fmt, tflint, terraform-docs, Go staticcheck, etc. + +- **PR trigger**: All pull requests, all paths (no path filter). +- **Push trigger**: Only on `main`, only when `*.tf`, `cmd/lambda/**`, `.pre-commit-config.yaml`, or `.terraform-docs.yml` change. +- **Job name**: `precommit` (required status check for merge). + +## Go Tests (`go-tests.yml`) + +Runs `go test -v -race ./...` in `cmd/lambda/` (Lambda unit tests). + +- **PR trigger**: Only when `cmd/lambda/**` changes. +- **Push trigger**: Only on `main`, same path filter. +- **Job name**: `go-test` (required status check for merge). +- **Note**: Path-filtered. If a PR doesn't touch Go code, this check won't run and won't block merge (see ruleset notes below). + +## Integration Tests (`integration-tests.yml`) + +Full end-to-end test: deploys real AWS infrastructure via Terratest, exercises the Lambda lifecycle (create NAT, scale-down, restart, cleanup), then destroys everything. + +- **PR trigger**: `labeled` type only. Runs when the `integration-test` label is added. +- **Manual trigger**: `workflow_dispatch`. +- **Condition**: `github.event.label.name == 'integration-test'` (or manual dispatch). +- **Concurrency**: Group `nat-zero-integration`, `cancel-in-progress: false`. Only one integration test runs at a time; new ones queue. +- **Environment**: `integration` (holds the `INTEGRATION_ROLE_ARN` secret for OIDC). +- **Timeout**: 15 minutes. +- **Job name**: `integration-test` (required status check for merge). + +### Steps + +1. Checkout, setup Go, setup Terraform (wrapper disabled). +2. Assume AWS role via OIDC (`aws-actions/configure-aws-credentials`). +3. Build the Lambda binary from source (`cmd/lambda/` -> `.build/lambda.zip`). +4. Run `go test -v -timeout 10m -count=1` in `tests/integration/`. + +## Docs (`docs.yml`) + +Deploys MkDocs Material to GitHub Pages. + +- **Trigger**: Push to `main` only, when `docs/**`, `mkdocs.yml`, `README.md`, or `*.tf` change. +- **Not a merge gate** -- only runs post-merge. +- Runs `mkdocs gh-deploy --force`. + +## Release Please (`release-please.yml`) + +Two-job workflow that automates versioning, changelogs, and Lambda binary distribution. + +### Job 1: `release-please` + +Runs `googleapis/release-please-action@v4` with: + +- **Config**: `release-please-config.json` -- `terraform-module` release type at repo root. +- **Manifest**: `.release-please-manifest.json` -- tracks current version (starts at `0.0.0`). + +#### How release-please works step by step + +1. Every push to `main` triggers this job. +2. Release-please scans commits since the last release for Conventional Commits (`feat:`, `fix:`, etc.). +3. If releasable commits exist (`feat` or `fix`), it **creates or updates a release PR** (e.g., `chore(main): release 0.1.0`) containing: + - Updated `CHANGELOG.md` with grouped entries per the configured sections (Features, Bug Fixes, Performance, Documentation, Miscellaneous). + - Version bump in `.release-please-manifest.json`. + - For `terraform-module` type: version strings in Terraform files if present. +4. The release PR sits open until merged. +5. When the release PR is merged, release-please runs again on that push. It detects its own merged PR and: + - Creates a **GitHub Release** with a version tag (e.g., `v0.1.0`). + - Sets output `release_created=true` and `tag_name=v0.1.0`. + +### Job 2: `build-lambda` + +Only runs when `release_created == 'true'` (i.e., the push that merges a release PR). + +1. Cross-compiles the Go Lambda for `linux/arm64`. +2. Zips as `lambda.zip`. +3. **Uploads to the versioned release** (e.g., `v0.1.0`). +4. **Creates/updates a rolling `nat-zero-lambda-latest` release** with the same zip. This provides a stable URL for the module's default `lambda_binary_url`. + +### Changelog sections + +| Commit prefix | Changelog section | Triggers release? | +|---------------|-------------------|-------------------| +| `feat:` | Features | Yes (minor bump) | +| `fix:` | Bug Fixes | Yes (patch bump) | +| `perf:` | Performance | No | +| `docs:` | Documentation | No | +| `chore:` | Miscellaneous | No | +| `feat!:` / `BREAKING CHANGE:` | Features | Yes (major bump) | + +## Repo Rulesets + +### `main` branch ruleset + +- **No direct push**: creation, update, deletion, and non-fast-forward all blocked. +- **PRs required** with: + - 1 approving review + - Stale reviews dismissed on push + - Last push approval required (reviewer cannot be the person who pushed the last commit) + - All review threads must be resolved + - **Squash merge only** +- **Required status checks**: `precommit`, `go-test`, `integration-test` + - `strict_required_status_checks_policy: false` -- checks that don't run (path filtering / label gating) won't block merge. +- **Bypass**: Admin role can bypass always. + +### `tags` ruleset + +- Protects `refs/tags/v*` -- no deletion or update of version tags. +- Ensures release-please's tags are immutable. +- Same admin bypass. + +## PR Lifecycle Summary + +``` +Open PR + -> precommit runs (always) + -> go-test runs (if cmd/lambda/** changed) + -> Add "integration-test" label -> integration tests run against real AWS + -> 1 approval + threads resolved + -> Squash merge to main + +Post-merge to main: + -> release-please creates/updates a release PR (if feat/fix commits exist) + -> docs deploy (if docs changed) + +Merge release PR: + -> release-please creates GitHub Release + tag + -> build-lambda uploads lambda.zip to release + rolling latest +``` diff --git a/tests/integration/fixture/main.tf b/tests/integration/fixture/main.tf index 2bb904e..0e47126 100644 --- a/tests/integration/fixture/main.tf +++ b/tests/integration/fixture/main.tf @@ -27,6 +27,10 @@ data "aws_subnets" "default" { name = "default-for-az" values = ["true"] } + filter { + name = "availability-zone" + values = ["us-east-1a"] + } } data "aws_subnet" "public" { From d3e0aba4bac1b751d2e694c3893304fc9ce12284 Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Wed, 25 Feb 2026 08:30:06 +1000 Subject: [PATCH 07/30] fix: prevent EIP leak from concurrent attachEIP races Add a DescribeNetworkInterfaces re-check in attachEIP after allocation to detect when another invocation already attached an EIP, and add an orphan EIP sweep in detachEIP to clean up any leaked allocations. Co-Authored-By: Claude Opus 4.6 --- cmd/lambda/ec2ops.go | 96 ++++++++++++++++++-------- cmd/lambda/ec2ops_test.go | 133 ++++++++++++++++++++++++++++++++++++- cmd/lambda/handler.go | 2 +- cmd/lambda/handler_test.go | 17 +++++ 4 files changed, 219 insertions(+), 29 deletions(-) diff --git a/cmd/lambda/ec2ops.go b/cmd/lambda/ec2ops.go index 2d850f6..7fa15c1 100644 --- a/cmd/lambda/ec2ops.go +++ b/cmd/lambda/ec2ops.go @@ -234,6 +234,23 @@ func (h *Handler) attachEIP(ctx context.Context, instanceID, az string) { } allocID := aws.ToString(alloc.AllocationId) + // Race-detection: re-check ENI before associating. Another invocation may + // have already attached an EIP between our first check and AllocateAddress. + niResp, descErr := h.EC2.DescribeNetworkInterfaces(ctx, &ec2.DescribeNetworkInterfacesInput{ + NetworkInterfaceIds: []string{eniID}, + }) + if descErr == nil && len(niResp.NetworkInterfaces) > 0 { + ni := niResp.NetworkInterfaces[0] + if ni.Association != nil && aws.ToString(ni.Association.PublicIp) != "" { + log.Printf("Race detected: ENI %s already has EIP %s, releasing %s", + eniID, aws.ToString(ni.Association.PublicIp), allocID) + h.EC2.ReleaseAddress(ctx, &ec2.ReleaseAddressInput{AllocationId: aws.String(allocID)}) + return + } + } else if descErr != nil { + log.Printf("Failed to re-check ENI %s (proceeding with associate): %v", eniID, descErr) + } + _, err = h.EC2.AssociateAddress(ctx, &ec2.AssociateAddressInput{ AllocationId: aws.String(allocID), NetworkInterfaceId: aws.String(eniID), @@ -247,8 +264,9 @@ func (h *Handler) attachEIP(ctx context.Context, instanceID, az string) { } // detachEIP waits for the NAT instance to reach "stopped", then disassociates -// and releases the EIP from the public ENI. Idempotent: no-op if no EIP. -func (h *Handler) detachEIP(ctx context.Context, instanceID string) { +// and releases the EIP from the public ENI. Also sweeps for orphaned EIPs +// left by concurrent attachEIP races. +func (h *Handler) detachEIP(ctx context.Context, instanceID, az string) { defer timed("detach_eip")() if !h.waitForState(ctx, instanceID, []string{"stopped"}, 120) { @@ -272,37 +290,63 @@ func (h *Handler) detachEIP(ctx context.Context, instanceID string) { log.Printf("Failed to describe ENI %s: %v", eniID, err) return } - if len(niResp.NetworkInterfaces) == 0 { - return - } - ni := niResp.NetworkInterfaces[0] - if ni.Association == nil || aws.ToString(ni.Association.AssociationId) == "" { - return - } - - assocID := aws.ToString(ni.Association.AssociationId) - allocID := aws.ToString(ni.Association.AllocationId) - publicIP := aws.ToString(ni.Association.PublicIp) + if len(niResp.NetworkInterfaces) > 0 { + ni := niResp.NetworkInterfaces[0] + if ni.Association != nil && aws.ToString(ni.Association.AssociationId) != "" { + assocID := aws.ToString(ni.Association.AssociationId) + allocID := aws.ToString(ni.Association.AllocationId) + publicIP := aws.ToString(ni.Association.PublicIp) - _, err = h.EC2.DisassociateAddress(ctx, &ec2.DisassociateAddressInput{ - AssociationId: aws.String(assocID), - }) - if err != nil { - if isErrCode(err, "InvalidAssociationID.NotFound") { - log.Printf("EIP already disassociated from %s", eniID) - } else { - log.Printf("Failed to detach EIP from %s: %v", eniID, err) - return + _, err = h.EC2.DisassociateAddress(ctx, &ec2.DisassociateAddressInput{ + AssociationId: aws.String(assocID), + }) + if err != nil { + if isErrCode(err, "InvalidAssociationID.NotFound") { + log.Printf("EIP already disassociated from %s", eniID) + } else { + log.Printf("Failed to detach EIP from %s: %v", eniID, err) + return + } + } + _, err = h.EC2.ReleaseAddress(ctx, &ec2.ReleaseAddressInput{ + AllocationId: aws.String(allocID), + }) + if err != nil { + log.Printf("Failed to release EIP %s: %v", allocID, err) + } else { + log.Printf("Released EIP %s from %s", publicIP, eniID) + } } } - _, err = h.EC2.ReleaseAddress(ctx, &ec2.ReleaseAddressInput{ - AllocationId: aws.String(allocID), + + // Orphan sweep: release any EIPs tagged for this AZ that were left behind + // by a concurrent attachEIP race. + addrResp, err := h.EC2.DescribeAddresses(ctx, &ec2.DescribeAddressesInput{ + Filters: []ec2types.Filter{ + {Name: aws.String("tag:" + h.NATTagKey), Values: []string{h.NATTagValue}}, + {Name: aws.String("tag:AZ"), Values: []string{az}}, + }, }) if err != nil { - log.Printf("Failed to release EIP %s: %v", allocID, err) + log.Printf("Orphan EIP sweep failed for %s: %v", az, err) return } - log.Printf("Released EIP %s from %s", publicIP, eniID) + for _, addr := range addrResp.Addresses { + orphanAllocID := aws.ToString(addr.AllocationId) + if addr.AssociationId != nil { + h.EC2.DisassociateAddress(ctx, &ec2.DisassociateAddressInput{ + AssociationId: addr.AssociationId, + }) + } + _, err := h.EC2.ReleaseAddress(ctx, &ec2.ReleaseAddressInput{ + AllocationId: aws.String(orphanAllocID), + }) + if err != nil { + log.Printf("Failed to release orphan EIP %s: %v", orphanAllocID, err) + } else { + log.Printf("Released orphan EIP %s in %s", orphanAllocID, az) + } + } } // --- Config version --- diff --git a/cmd/lambda/ec2ops_test.go b/cmd/lambda/ec2ops_test.go index ec8787e..4bb2dde 100644 --- a/cmd/lambda/ec2ops_test.go +++ b/cmd/lambda/ec2ops_test.go @@ -276,6 +276,13 @@ func TestAttachEIP(t *testing.T) { mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-1"), PublicIp: aws.String("1.2.3.4")}, nil } + mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { + return &ec2.DescribeNetworkInterfacesOutput{ + NetworkInterfaces: []ec2types.NetworkInterface{{ + NetworkInterfaceId: aws.String("eni-pub1"), + }}, + }, nil + } mock.AssociateAddressFn = func(ctx context.Context, params *ec2.AssociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AssociateAddressOutput, error) { return &ec2.AssociateAddressOutput{}, nil } @@ -328,6 +335,13 @@ func TestAttachEIP(t *testing.T) { mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-1"), PublicIp: aws.String("1.2.3.4")}, nil } + mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { + return &ec2.DescribeNetworkInterfacesOutput{ + NetworkInterfaces: []ec2types.NetworkInterface{{ + NetworkInterfaceId: aws.String("eni-pub1"), + }}, + }, nil + } mock.AssociateAddressFn = func(ctx context.Context, params *ec2.AssociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AssociateAddressOutput, error) { return nil, fmt.Errorf("InvalidParameterValue: Bad param") } @@ -350,6 +364,58 @@ func TestAttachEIP(t *testing.T) { t.Error("expected no AllocateAddress when no public ENI") } }) + + t.Run("race detected releases allocated EIP", func(t *testing.T) { + mock := &mockEC2{} + eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil + } + mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { + return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-1"), PublicIp: aws.String("1.2.3.4")}, nil + } + // Re-check shows another invocation already attached an EIP + mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { + return &ec2.DescribeNetworkInterfacesOutput{ + NetworkInterfaces: []ec2types.NetworkInterface{{ + NetworkInterfaceId: aws.String("eni-pub1"), + Association: &ec2types.NetworkInterfaceAssociation{ + PublicIp: aws.String("9.9.9.9"), + }, + }}, + }, nil + } + h := newTestHandler(mock) + h.attachEIP(context.Background(), "i-nat1", testAZ) + if mock.callCount("AssociateAddress") != 0 { + t.Error("expected no AssociateAddress when race detected") + } + if mock.callCount("ReleaseAddress") != 1 { + t.Errorf("expected 1 ReleaseAddress call, got %d", mock.callCount("ReleaseAddress")) + } + }) + + t.Run("describe ENI failure still associates", func(t *testing.T) { + mock := &mockEC2{} + eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil + } + mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { + return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-1"), PublicIp: aws.String("1.2.3.4")}, nil + } + mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { + return nil, fmt.Errorf("Throttling: Rate exceeded") + } + mock.AssociateAddressFn = func(ctx context.Context, params *ec2.AssociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AssociateAddressOutput, error) { + return &ec2.AssociateAddressOutput{}, nil + } + h := newTestHandler(mock) + h.attachEIP(context.Background(), "i-nat1", testAZ) + if mock.callCount("AssociateAddress") != 1 { + t.Error("expected AssociateAddress to proceed despite describe failure") + } + }) } // --- detachEIP() --- @@ -373,8 +439,11 @@ func TestDetachEIP(t *testing.T) { }}, }, nil } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{Addresses: []ec2types.Address{}}, nil + } h := newTestHandler(mock) - h.detachEIP(context.Background(), "i-nat1") + h.detachEIP(context.Background(), "i-nat1", testAZ) if mock.callCount("DisassociateAddress") != 1 { t.Error("expected DisassociateAddress") } @@ -396,12 +465,72 @@ func TestDetachEIP(t *testing.T) { }}, }, nil } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{Addresses: []ec2types.Address{}}, nil + } h := newTestHandler(mock) - h.detachEIP(context.Background(), "i-nat1") + h.detachEIP(context.Background(), "i-nat1", testAZ) if mock.callCount("DisassociateAddress") != 0 { t.Error("expected DisassociateAddress NOT to be called") } }) + + t.Run("cleans up orphaned EIPs", func(t *testing.T) { + mock := &mockEC2{} + eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(makeTestInstance("i-nat1", "stopped", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil + } + mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { + return &ec2.DescribeNetworkInterfacesOutput{ + NetworkInterfaces: []ec2types.NetworkInterface{{ + NetworkInterfaceId: aws.String("eni-pub1"), + Association: &ec2types.NetworkInterfaceAssociation{ + AssociationId: aws.String("eipassoc-1"), + AllocationId: aws.String("eipalloc-1"), + PublicIp: aws.String("1.2.3.4"), + }, + }}, + }, nil + } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{ + Addresses: []ec2types.Address{{ + AllocationId: aws.String("eipalloc-orphan"), + }}, + }, nil + } + h := newTestHandler(mock) + h.detachEIP(context.Background(), "i-nat1", testAZ) + // 1 from current association + 1 from orphan sweep + if mock.callCount("ReleaseAddress") != 2 { + t.Errorf("expected 2 ReleaseAddress calls, got %d", mock.callCount("ReleaseAddress")) + } + }) + + t.Run("orphan sweep error is non-fatal", func(t *testing.T) { + mock := &mockEC2{} + eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(makeTestInstance("i-nat1", "stopped", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil + } + mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { + return &ec2.DescribeNetworkInterfacesOutput{ + NetworkInterfaces: []ec2types.NetworkInterface{{ + NetworkInterfaceId: aws.String("eni-pub1"), + }}, + }, nil + } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return nil, fmt.Errorf("Throttling: Rate exceeded") + } + h := newTestHandler(mock) + // Should not panic + h.detachEIP(context.Background(), "i-nat1", testAZ) + if mock.callCount("DisassociateAddress") != 0 { + t.Error("expected no DisassociateAddress when no ENI association") + } + }) } // --- createNAT() --- diff --git a/cmd/lambda/handler.go b/cmd/lambda/handler.go index 1312405..c8fae54 100644 --- a/cmd/lambda/handler.go +++ b/cmd/lambda/handler.go @@ -55,7 +55,7 @@ func (h *Handler) handle(ctx context.Context, event Event) error { if isStarting(state) { h.attachEIP(ctx, iid, az) } else if isStopping(state) { - h.detachEIP(ctx, iid) + h.detachEIP(ctx, iid, az) } return nil } diff --git a/cmd/lambda/handler_test.go b/cmd/lambda/handler_test.go index f82ba8a..80871b6 100644 --- a/cmd/lambda/handler_test.go +++ b/cmd/lambda/handler_test.go @@ -83,6 +83,13 @@ func TestHandlerNatEvents(t *testing.T) { mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-1"), PublicIp: aws.String("1.2.3.4")}, nil } + mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { + return &ec2.DescribeNetworkInterfacesOutput{ + NetworkInterfaces: []ec2types.NetworkInterface{{ + NetworkInterfaceId: aws.String("eni-pub1"), + }}, + }, nil + } mock.AssociateAddressFn = func(ctx context.Context, params *ec2.AssociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AssociateAddressOutput, error) { return &ec2.AssociateAddressOutput{}, nil } @@ -116,6 +123,13 @@ func TestHandlerNatEvents(t *testing.T) { mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-1"), PublicIp: aws.String("1.2.3.4")}, nil } + mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { + return &ec2.DescribeNetworkInterfacesOutput{ + NetworkInterfaces: []ec2types.NetworkInterface{{ + NetworkInterfaceId: aws.String("eni-pub1"), + }}, + }, nil + } mock.AssociateAddressFn = func(ctx context.Context, params *ec2.AssociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AssociateAddressOutput, error) { return &ec2.AssociateAddressOutput{}, nil } @@ -166,6 +180,9 @@ func TestHandlerNatEvents(t *testing.T) { }}, }, nil } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{Addresses: []ec2types.Address{}}, nil + } h := newTestHandler(mock) err := h.HandleRequest(context.Background(), Event{InstanceID: "i-nat1", State: "stopped"}) if err != nil { From d49b87da0e4894a2a2a513f90ad2030bfef187e3 Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Wed, 25 Feb 2026 10:33:12 +1000 Subject: [PATCH 08/30] fix: resolve scale-down race from EC2 API eventual consistency MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The integration test NATScaleDown phase was failing because findSiblings counted the dying workload as a "sibling" — EventBridge fires the shutting-down event before DescribeInstances reflects the new state, so the workload still appeared as "running" in the API response. Three fixes: 1. findSiblings now accepts an excludeID parameter so the triggering instance is never counted as a sibling. 2. maybeStopNAT retries when siblings ARE found (letting eventual consistency settle) instead of giving up immediately on the first false positive. 3. When classify can't find a terminated instance (already gone from the API), the handler falls back to sweepIdleNATs which checks all running NATs in the VPC and stops any with no siblings. Also uses StopInstances Force=true since NAT instances are stateless forwarders with no filesystem to flush. Co-Authored-By: Claude Opus 4.6 --- cmd/lambda/ec2ops.go | 33 +++++++++- cmd/lambda/ec2ops_test.go | 124 +++++++++++++++++++++++++++++++++++-- cmd/lambda/handler.go | 33 +++++++--- cmd/lambda/handler_test.go | 87 ++++++++++++++++++++++++-- 4 files changed, 258 insertions(+), 19 deletions(-) diff --git a/cmd/lambda/ec2ops.go b/cmd/lambda/ec2ops.go index 7fa15c1..96141e0 100644 --- a/cmd/lambda/ec2ops.go +++ b/cmd/lambda/ec2ops.go @@ -152,7 +152,7 @@ func (h *Handler) findNAT(ctx context.Context, az, vpc string) *Instance { return keep } -func (h *Handler) findSiblings(ctx context.Context, az, vpc string) []*Instance { +func (h *Handler) findSiblings(ctx context.Context, az, vpc, excludeID string) []*Instance { defer timed("find_siblings")() resp, err := h.EC2.DescribeInstances(ctx, &ec2.DescribeInstancesInput{ Filters: []ec2types.Filter{ @@ -170,6 +170,9 @@ func (h *Handler) findSiblings(ctx context.Context, az, vpc string) []*Instance for _, r := range resp.Reservations { for _, i := range r.Instances { inst := instanceFromAPI(i) + if inst.InstanceID == excludeID { + continue + } if !hasTag(inst.Tags, h.NATTagKey, h.NATTagValue) && !hasTag(inst.Tags, h.IgnoreTagKey, h.IgnoreTagValue) { siblings = append(siblings, inst) @@ -535,6 +538,7 @@ func (h *Handler) stopNAT(ctx context.Context, inst *Instance) { iid := inst.InstanceID _, err := h.EC2.StopInstances(ctx, &ec2.StopInstancesInput{ InstanceIds: []string{iid}, + Force: aws.Bool(true), }) if err != nil { log.Printf("Failed to stop NAT %s: %v", iid, err) @@ -543,6 +547,33 @@ func (h *Handler) stopNAT(ctx context.Context, inst *Instance) { log.Printf("Stopped NAT %s", iid) } +// sweepIdleNATs is a fallback for when classify can't find the triggering +// instance (e.g. it's already gone from the EC2 API after termination). +// It checks every running NAT in the VPC and stops any with no siblings. +func (h *Handler) sweepIdleNATs(ctx context.Context, triggerID string) { + defer timed("sweep_idle_nats")() + resp, err := h.EC2.DescribeInstances(ctx, &ec2.DescribeInstancesInput{ + Filters: []ec2types.Filter{ + {Name: aws.String("tag:" + h.NATTagKey), Values: []string{h.NATTagValue}}, + {Name: aws.String("vpc-id"), Values: []string{h.TargetVPC}}, + {Name: aws.String("instance-state-name"), Values: []string{"pending", "running"}}, + }, + }) + if err != nil { + log.Printf("Sweep failed: %v", err) + return + } + for _, r := range resp.Reservations { + for _, i := range r.Instances { + nat := instanceFromAPI(i) + if len(h.findSiblings(ctx, nat.AZ, nat.VpcID, triggerID)) == 0 { + log.Printf("Sweep: no siblings for NAT %s in %s, stopping", nat.InstanceID, nat.AZ) + h.stopNAT(ctx, nat) + } + } + } +} + // --- Cleanup (destroy-time) --- func (h *Handler) cleanupAll(ctx context.Context) { diff --git a/cmd/lambda/ec2ops_test.go b/cmd/lambda/ec2ops_test.go index 4bb2dde..281bcb9 100644 --- a/cmd/lambda/ec2ops_test.go +++ b/cmd/lambda/ec2ops_test.go @@ -225,7 +225,7 @@ func TestFindSiblings(t *testing.T) { return describeResponse(), nil } h := newTestHandler(mock) - sibs := h.findSiblings(context.Background(), testAZ, testVPC) + sibs := h.findSiblings(context.Background(), testAZ, testVPC, "") if len(sibs) != 0 { t.Errorf("expected 0 siblings, got %d", len(sibs)) } @@ -239,7 +239,7 @@ func TestFindSiblings(t *testing.T) { return describeResponse(work), nil } h := newTestHandler(mock) - sibs := h.findSiblings(context.Background(), testAZ, testVPC) + sibs := h.findSiblings(context.Background(), testAZ, testVPC, "") if len(sibs) != 1 || sibs[0].InstanceID != "i-work" { t.Errorf("expected [i-work], got %v", sibs) } @@ -257,11 +257,41 @@ func TestFindSiblings(t *testing.T) { return describeResponse(work, nat, ignored), nil } h := newTestHandler(mock) - sibs := h.findSiblings(context.Background(), testAZ, testVPC) + sibs := h.findSiblings(context.Background(), testAZ, testVPC, "") if len(sibs) != 1 || sibs[0].InstanceID != "i-work" { t.Errorf("expected [i-work], got %v", sibs) } }) + + t.Run("excludes trigger instance", func(t *testing.T) { + mock := &mockEC2{} + trigger := makeTestInstance("i-dying", "running", testVPC, testAZ, + []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("api")}}, nil) + other := makeTestInstance("i-alive", "running", testVPC, testAZ, + []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}}, nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(trigger, other), nil + } + h := newTestHandler(mock) + sibs := h.findSiblings(context.Background(), testAZ, testVPC, "i-dying") + if len(sibs) != 1 || sibs[0].InstanceID != "i-alive" { + t.Errorf("expected [i-alive], got %v", sibs) + } + }) + + t.Run("excludes trigger when it is only instance", func(t *testing.T) { + mock := &mockEC2{} + trigger := makeTestInstance("i-dying", "running", testVPC, testAZ, + []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("api")}}, nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(trigger), nil + } + h := newTestHandler(mock) + sibs := h.findSiblings(context.Background(), testAZ, testVPC, "i-dying") + if len(sibs) != 0 { + t.Errorf("expected 0 siblings, got %d", len(sibs)) + } + }) } // --- attachEIP() --- @@ -685,8 +715,14 @@ func TestStartNAT(t *testing.T) { // --- stopNAT() --- func TestStopNAT(t *testing.T) { - t.Run("happy path just stops", func(t *testing.T) { + t.Run("happy path uses force stop", func(t *testing.T) { mock := &mockEC2{} + mock.StopInstancesFn = func(ctx context.Context, params *ec2.StopInstancesInput, optFns ...func(*ec2.Options)) (*ec2.StopInstancesOutput, error) { + if params.Force == nil || !*params.Force { + t.Error("expected Force=true in StopInstances") + } + return &ec2.StopInstancesOutput{}, nil + } h := newTestHandler(mock) h.stopNAT(context.Background(), &Instance{InstanceID: "i-nat1"}) if mock.callCount("StopInstances") != 1 { @@ -708,6 +744,86 @@ func TestStopNAT(t *testing.T) { }) } +// --- sweepIdleNATs() --- + +func TestSweepIdleNATs(t *testing.T) { + natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} + + t.Run("stops NAT with no siblings", func(t *testing.T) { + mock := &mockEC2{} + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) + var callIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&callIdx, 1) + if idx == 1 { + // sweep query: find running NATs + return describeResponse(natInst), nil + } + // findSiblings query: no siblings + return describeResponse(), nil + } + h := newTestHandler(mock) + h.sweepIdleNATs(context.Background(), "i-trigger") + if mock.callCount("StopInstances") != 1 { + t.Errorf("expected StopInstances to be called once, got %d", mock.callCount("StopInstances")) + } + }) + + t.Run("keeps NAT with siblings", func(t *testing.T) { + mock := &mockEC2{} + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) + sibInst := makeTestInstance("i-sib1", "running", testVPC, testAZ, + []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}}, nil) + var callIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&callIdx, 1) + if idx == 1 { + return describeResponse(natInst), nil + } + return describeResponse(sibInst), nil + } + h := newTestHandler(mock) + h.sweepIdleNATs(context.Background(), "i-trigger") + if mock.callCount("StopInstances") != 0 { + t.Error("expected StopInstances NOT to be called") + } + }) + + t.Run("excludes trigger from siblings", func(t *testing.T) { + mock := &mockEC2{} + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) + // The trigger instance still appears as running (EC2 eventual consistency) + triggerInst := makeTestInstance("i-trigger", "running", testVPC, testAZ, + []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}}, nil) + var callIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&callIdx, 1) + if idx == 1 { + return describeResponse(natInst), nil + } + // findSiblings returns the trigger instance (but it should be excluded) + return describeResponse(triggerInst), nil + } + h := newTestHandler(mock) + h.sweepIdleNATs(context.Background(), "i-trigger") + if mock.callCount("StopInstances") != 1 { + t.Errorf("expected StopInstances (trigger should be excluded), got %d", mock.callCount("StopInstances")) + } + }) + + t.Run("no running NATs is noop", func(t *testing.T) { + mock := &mockEC2{} + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(), nil + } + h := newTestHandler(mock) + h.sweepIdleNATs(context.Background(), "i-trigger") + if mock.callCount("StopInstances") != 0 { + t.Error("expected no StopInstances calls") + } + }) +} + // --- cleanupAll() --- func TestCleanupAll(t *testing.T) { diff --git a/cmd/lambda/handler.go b/cmd/lambda/handler.go index c8fae54..73400b9 100644 --- a/cmd/lambda/handler.go +++ b/cmd/lambda/handler.go @@ -47,6 +47,13 @@ func (h *Handler) handle(ctx context.Context, event Event) error { ignore, isNAT, az, vpc := h.classify(ctx, iid) if ignore { + // If the instance can no longer be found (e.g. terminated and gone + // from the API), fall back to a VPC-wide sweep so we don't miss the + // scale-down opportunity. + if isTerminating(state) { + log.Printf("Instance %s gone (state=%s), sweeping for idle NATs", iid, state) + h.sweepIdleNATs(ctx, iid) + } return nil } @@ -69,7 +76,7 @@ func (h *Handler) handle(ctx context.Context, event Event) error { } if isStopping(state) || isTerminating(state) { - h.maybeStopNAT(ctx, nat, az, vpc) + h.maybeStopNAT(ctx, nat, az, vpc, iid) } return nil } @@ -97,19 +104,29 @@ func (h *Handler) ensureNAT(ctx context.Context, nat *Instance, az, vpc string) } // maybeStopNAT stops the NAT if no sibling workloads remain. -func (h *Handler) maybeStopNAT(ctx context.Context, nat *Instance, az, vpc string) { +// triggerID is the instance whose state change triggered this check; it is +// excluded from the sibling query so that a dying workload doesn't count +// itself as a reason to keep the NAT alive. +func (h *Handler) maybeStopNAT(ctx context.Context, nat *Instance, az, vpc, triggerID string) { if nat == nil { return } - // Brief retry to let concurrent events settle. + // Retry to let EC2 API eventual consistency settle. + // Sleep before each check so DescribeInstances reflects the latest state. + var siblings []*Instance for attempt := 0; attempt < 3; attempt++ { - if len(h.findSiblings(ctx, az, vpc)) > 0 { - log.Printf("Siblings still running in %s, keeping NAT", az) - return - } - if attempt < 2 { + if attempt > 0 { h.sleep(2 * time.Second) } + siblings = h.findSiblings(ctx, az, vpc, triggerID) + if len(siblings) == 0 { + break + } + log.Printf("Siblings found in %s (attempt %d/3), rechecking", az, attempt+1) + } + if len(siblings) > 0 { + log.Printf("Siblings still running in %s after retries, keeping NAT", az) + return } if isStarting(nat.StateName) { diff --git a/cmd/lambda/handler_test.go b/cmd/lambda/handler_test.go index 80871b6..0297f1c 100644 --- a/cmd/lambda/handler_test.go +++ b/cmd/lambda/handler_test.go @@ -66,6 +66,52 @@ func TestHandlerIgnored(t *testing.T) { t.Errorf("expected 1 DescribeInstances call (classify), got %d", mock.callCount("DescribeInstances")) } }) + + t.Run("terminated event sweeps idle NATs when instance gone", func(t *testing.T) { + mock := &mockEC2{} + natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) + var callIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&callIdx, 1) + if idx == 1 { + // classify: instance not found (already gone from API) + return describeResponse(), nil + } + if params.Filters != nil { + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil + } + } + } + // findSiblings: no siblings + return describeResponse(), nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-gone", State: "terminated"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("StopInstances") != 1 { + t.Errorf("expected StopInstances via sweep, got %d", mock.callCount("StopInstances")) + } + }) + + t.Run("non-terminating ignored event does not sweep", func(t *testing.T) { + mock := &mockEC2{} + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(), nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-skip", State: "running"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("StopInstances") != 0 { + t.Error("expected no sweep for non-terminating event") + } + }) } // --- NAT instance events (EventBridge-driven EIP management) --- @@ -525,13 +571,12 @@ func TestHandlerWorkloadScaleDown(t *testing.T) { } }) - t.Run("siblings appear on retry", func(t *testing.T) { + t.Run("persistent siblings keeps NAT after retries", func(t *testing.T) { mock := &mockEC2{} workInst := makeTestInstance("i-work1", "stopping", testVPC, testAZ, workTags, nil) natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) sibInst := makeTestInstance("i-sib1", "running", testVPC, testAZ, workTags, nil) var callIdx int32 - var sibCallIdx int32 mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { idx := atomic.AddInt32(&callIdx, 1) if idx == 1 { @@ -544,10 +589,7 @@ func TestHandlerWorkloadScaleDown(t *testing.T) { } } } - sibIdx := atomic.AddInt32(&sibCallIdx, 1) - if sibIdx == 1 { - return describeResponse(), nil - } + // All findSiblings calls return a sibling return describeResponse(sibInst), nil } h := newTestHandler(mock) @@ -560,6 +602,39 @@ func TestHandlerWorkloadScaleDown(t *testing.T) { } }) + t.Run("trigger instance excluded from siblings", func(t *testing.T) { + mock := &mockEC2{} + // The trigger workload still shows as "running" due to EC2 eventual consistency + workInst := makeTestInstance("i-work1", "running", testVPC, testAZ, workTags, nil) + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) + var callIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&callIdx, 1) + if idx == 1 { + // classify: the trigger instance + return describeResponse(workInst), nil + } + if params.Filters != nil { + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil + } + } + } + // findSiblings: the trigger instance shows as running (eventual consistency) + // but should be excluded by its ID + return describeResponse(workInst), nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "shutting-down"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("StopInstances") != 1 { + t.Errorf("expected StopInstances (trigger excluded from siblings), got %d", mock.callCount("StopInstances")) + } + }) + t.Run("pending NAT no siblings stops", func(t *testing.T) { mock := &mockEC2{} workInst := makeTestInstance("i-work1", "stopped", testVPC, testAZ, workTags, nil) From c1628d11d27bb83f7a22d8498651905de6baa761 Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Wed, 25 Feb 2026 13:09:41 +1000 Subject: [PATCH 09/30] test: add race condition catalog with unit tests and docs Map all 10 identified race conditions (R1-R10) across scale-down, scale-up, EIP, and ENI subsystems. Add dedicated TestRace_R* tests in race_test.go and a Race Conditions section in ARCHITECTURE.md with sequence diagrams for the three highest-severity races. Co-Authored-By: Claude Opus 4.6 --- cmd/lambda/race_test.go | 685 ++++++++++++++++++++++++++++++++++++++++ docs/ARCHITECTURE.md | 132 ++++++++ 2 files changed, 817 insertions(+) create mode 100644 cmd/lambda/race_test.go diff --git a/cmd/lambda/race_test.go b/cmd/lambda/race_test.go new file mode 100644 index 0000000..21ca3f2 --- /dev/null +++ b/cmd/lambda/race_test.go @@ -0,0 +1,685 @@ +package main + +import ( + "context" + "fmt" + "sync/atomic" + "testing" + + "github.com/aws/aws-sdk-go-v2/aws" + "github.com/aws/aws-sdk-go-v2/service/ec2" + ec2types "github.com/aws/aws-sdk-go-v2/service/ec2/types" + "github.com/aws/smithy-go" +) + +// ============================================================================= +// Race Condition Tests +// +// Each TestRace_R* function documents and verifies the behavior of a specific +// race condition identified in the nat-zero Lambda. Race conditions arise from: +// - Multiple concurrent Lambda invocations from overlapping EventBridge events +// - EC2 API eventual consistency (state changes not immediately visible) +// +// See docs/ARCHITECTURE.md "Race Conditions" section for the full catalog. +// ============================================================================= + +// TestRace_R1_StaleSiblingEventualConsistency verifies the retry logic in +// maybeStopNAT when EC2 eventual consistency causes a dying workload to still +// appear as "running" in DescribeInstances. +// +// Race scenario: +// - Workload i-dying fires shutting-down event +// - Lambda calls findSiblings, but EC2 API still returns i-dying as "running" +// - Without mitigation, the NAT would never be stopped +// +// Mitigation: maybeStopNAT excludes the trigger instance ID from siblings AND +// retries up to 3 times with 2s sleep between attempts. +func TestRace_R1_StaleSiblingEventualConsistency(t *testing.T) { + t.Run("trigger excluded from siblings on first attempt", func(t *testing.T) { + mock := &mockEC2{} + workTags := []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}} + natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} + workInst := makeTestInstance("i-dying", "running", testVPC, testAZ, workTags, nil) + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) + + var callIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&callIdx, 1) + if idx == 1 { + // classify: returns the workload instance + return describeResponse(workInst), nil + } + if params.Filters != nil { + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil + } + } + } + // findSiblings: trigger still appears as running (eventual consistency) + // but should be excluded by excludeID + return describeResponse(workInst), nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-dying", State: "shutting-down"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("StopInstances") != 1 { + t.Errorf("expected StopInstances=1 (trigger excluded), got %d", mock.callCount("StopInstances")) + } + }) + + t.Run("other stale sibling clears on retry", func(t *testing.T) { + mock := &mockEC2{} + workTags := []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}} + natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} + workInst := makeTestInstance("i-dying", "stopping", testVPC, testAZ, workTags, nil) + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) + staleInst := makeTestInstance("i-stale", "running", testVPC, testAZ, workTags, nil) + + var callIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&callIdx, 1) + if idx == 1 { + return describeResponse(workInst), nil + } + if params.Filters != nil { + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil + } + } + } + // First findSiblings call: stale sibling still running + // Subsequent calls: sibling gone (EC2 caught up) + if idx <= 3 { + return describeResponse(staleInst), nil + } + return describeResponse(), nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-dying", State: "stopping"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("StopInstances") != 1 { + t.Errorf("expected StopInstances=1 after retry succeeds, got %d", mock.callCount("StopInstances")) + } + }) +} + +// TestRace_R2_TerminatedInstanceGoneFromAPI verifies the sweepIdleNATs fallback +// when classify returns ignore=true because the terminated instance has already +// been purged from the EC2 API. +// +// Race scenario: +// - Workload terminates and EventBridge fires "terminated" event +// - By the time Lambda calls DescribeInstances, the instance is gone +// - classify returns ignore=true, normal scale-down path is skipped +// +// Mitigation: handler detects isTerminating(state) + ignore and calls +// sweepIdleNATs to check all NATs in the VPC for idle ones. +func TestRace_R2_TerminatedInstanceGoneFromAPI(t *testing.T) { + t.Run("sweep stops idle NAT when trigger gone", func(t *testing.T) { + // Already covered by handler_test.go "terminated event sweeps idle NATs" + // This variant ensures the sweep mechanism works end-to-end. + mock := &mockEC2{} + natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) + var callIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&callIdx, 1) + if idx == 1 { + // classify: instance gone + return describeResponse(), nil + } + if params.Filters != nil { + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil + } + } + } + // findSiblings: no siblings + return describeResponse(), nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-gone", State: "terminated"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("StopInstances") != 1 { + t.Errorf("expected sweep to stop idle NAT, got StopInstances=%d", mock.callCount("StopInstances")) + } + }) + + t.Run("sweep handles multiple NATs across AZs", func(t *testing.T) { + mock := &mockEC2{} + natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} + workTags := []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}} + natAZ1 := makeTestInstance("i-nat-az1", "running", testVPC, "us-east-1a", natTags, nil) + natAZ2 := makeTestInstance("i-nat-az2", "running", testVPC, "us-east-1b", natTags, nil) + sibAZ2 := makeTestInstance("i-sib-az2", "running", testVPC, "us-east-1b", workTags, nil) + + var callIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&callIdx, 1) + if idx == 1 { + // classify: instance gone + return describeResponse(), nil + } + if params.Filters != nil { + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + // sweep: both NATs found + return describeResponse(natAZ1, natAZ2), nil + } + } + // findSiblings: check AZ filter + for _, f := range params.Filters { + if aws.ToString(f.Name) == "availability-zone" { + if f.Values[0] == "us-east-1b" { + return describeResponse(sibAZ2), nil + } + return describeResponse(), nil + } + } + } + return describeResponse(), nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-gone", State: "shutting-down"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + // Only NAT in AZ1 should be stopped (AZ2 has a sibling) + if mock.callCount("StopInstances") != 1 { + t.Errorf("expected 1 StopInstances (only idle NAT), got %d", mock.callCount("StopInstances")) + } + }) +} + +// TestRace_R3_RetryExhaustion verifies the accepted risk when EC2 eventual +// consistency takes longer than the retry budget (3 attempts x 2s = 6s). +// +// Race scenario: +// - A sibling workload is shutting down but EC2 API never reflects the change +// within the retry window +// - findSiblings persistently returns a stale sibling on all 3 attempts +// +// Accepted risk: NAT stays running. The next scale-down event or sweepIdleNATs +// will eventually catch it. +func TestRace_R3_RetryExhaustion(t *testing.T) { + mock := &mockEC2{} + workTags := []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}} + natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} + workInst := makeTestInstance("i-dying", "stopping", testVPC, testAZ, workTags, nil) + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) + staleInst := makeTestInstance("i-stale", "running", testVPC, testAZ, workTags, nil) + + var callIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&callIdx, 1) + if idx == 1 { + return describeResponse(workInst), nil + } + if params.Filters != nil { + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil + } + } + } + // All 3 findSiblings attempts: stale sibling persists + return describeResponse(staleInst), nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-dying", State: "stopping"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + // Accepted risk: NAT kept alive because stale sibling never cleared + if mock.callCount("StopInstances") != 0 { + t.Error("expected StopInstances NOT called (retry exhaustion, accepted risk)") + } +} + +// TestRace_R4_DuplicateNATCreation verifies the reactive deduplication in +// findNAT when two concurrent Lambda invocations both create a NAT instance. +// +// Race scenario: +// - Two workload pending events arrive simultaneously +// - Both Lambda invocations call findNAT → nil, both call createNAT +// - Two NAT instances now exist in the same AZ +// +// Mitigation: findNAT detects multiple NATs, keeps the first running one, +// and terminates the extras via TerminateInstances. +func TestRace_R4_DuplicateNATCreation(t *testing.T) { + t.Run("two running NATs deduplicates to one", func(t *testing.T) { + mock := &mockEC2{} + nat1 := makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, nil) + nat2 := makeTestInstance("i-nat2", "running", testVPC, testAZ, nil, nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(nat1, nat2), nil + } + h := newTestHandler(mock) + result := h.findNAT(context.Background(), testAZ, testVPC) + if result == nil { + t.Fatal("expected a NAT to be returned") + } + if result.InstanceID != "i-nat1" { + t.Errorf("expected first running NAT i-nat1, got %s", result.InstanceID) + } + if mock.callCount("TerminateInstances") != 1 { + t.Errorf("expected 1 TerminateInstances (extra NAT), got %d", mock.callCount("TerminateInstances")) + } + }) + + t.Run("running NAT preferred over stopped", func(t *testing.T) { + mock := &mockEC2{} + stopped := makeTestInstance("i-stopped", "stopped", testVPC, testAZ, nil, nil) + running := makeTestInstance("i-running", "running", testVPC, testAZ, nil, nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(stopped, running), nil + } + h := newTestHandler(mock) + result := h.findNAT(context.Background(), testAZ, testVPC) + if result == nil || result.InstanceID != "i-running" { + t.Errorf("expected running NAT to be kept, got %v", result) + } + if mock.callCount("TerminateInstances") != 1 { + t.Errorf("expected 1 TerminateInstances, got %d", mock.callCount("TerminateInstances")) + } + }) + + t.Run("three NATs terminates two extras", func(t *testing.T) { + mock := &mockEC2{} + nat1 := makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, nil) + nat2 := makeTestInstance("i-nat2", "running", testVPC, testAZ, nil, nil) + nat3 := makeTestInstance("i-nat3", "pending", testVPC, testAZ, nil, nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(nat1, nat2, nat3), nil + } + h := newTestHandler(mock) + result := h.findNAT(context.Background(), testAZ, testVPC) + if result == nil || result.InstanceID != "i-nat1" { + t.Errorf("expected first running NAT kept, got %v", result) + } + if mock.callCount("TerminateInstances") != 2 { + t.Errorf("expected 2 TerminateInstances (two extras), got %d", mock.callCount("TerminateInstances")) + } + }) +} + +// TestRace_R5_StartStopOverlap verifies behavior when a scale-up event fires +// while a concurrent scale-down is stopping the NAT. +// +// Race scenario: +// - Scale-down Lambda invocation calls StopInstances on the NAT +// - New workload pending event fires, Lambda sees NAT in "stopping" state +// - ensureNAT sees isStopping → calls startNAT +// - startNAT waits for "stopped" then calls StartInstances +// +// Accepted risk: Brief delay while NAT transitions stopping→stopped→starting. +func TestRace_R5_StartStopOverlap(t *testing.T) { + t.Run("stopping NAT waits then starts", func(t *testing.T) { + mock := &mockEC2{} + workTags := []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}} + natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} + workInst := makeTestInstance("i-work1", "pending", testVPC, testAZ, workTags, nil) + stoppingNAT := makeTestInstance("i-nat1", "stopping", testVPC, testAZ, natTags, nil) + stoppedNAT := makeTestInstance("i-nat1", "stopped", testVPC, testAZ, natTags, nil) + + var callIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&callIdx, 1) + if idx == 1 { + // classify: workload instance + return describeResponse(workInst), nil + } + if params.Filters != nil { + // findNAT: NAT is stopping + return describeResponse(stoppingNAT), nil + } + // waitForState in startNAT: first call returns stopping, second returns stopped + if idx <= 3 { + return describeResponse(stoppingNAT), nil + } + return describeResponse(stoppedNAT), nil + } + mock.StartInstancesFn = func(ctx context.Context, params *ec2.StartInstancesInput, optFns ...func(*ec2.Options)) (*ec2.StartInstancesOutput, error) { + return &ec2.StartInstancesOutput{}, nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "pending"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("StartInstances") != 1 { + t.Errorf("expected StartInstances=1 (NAT restarted after stop), got %d", mock.callCount("StartInstances")) + } + }) +} + +// TestRace_R6_DoubleEIPAllocation verifies the race-detection re-check in +// attachEIP when two concurrent Lambda invocations (from pending + running +// events) both try to allocate an EIP for the same NAT. +// +// Race scenario: +// - NAT pending event → Lambda A calls attachEIP, allocates EIP-A +// - NAT running event → Lambda B calls attachEIP, allocates EIP-B +// - Both try to associate to the same ENI +// +// Mitigation: After AllocateAddress, attachEIP re-checks the ENI via +// DescribeNetworkInterfaces. If another EIP is already associated, it releases +// the duplicate allocation. +func TestRace_R6_DoubleEIPAllocation(t *testing.T) { + t.Run("re-check detects existing EIP and releases duplicate", func(t *testing.T) { + mock := &mockEC2{} + eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil + } + mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { + return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-dup"), PublicIp: aws.String("2.2.2.2")}, nil + } + // Re-check: another invocation already attached an EIP + mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { + return &ec2.DescribeNetworkInterfacesOutput{ + NetworkInterfaces: []ec2types.NetworkInterface{{ + NetworkInterfaceId: aws.String("eni-pub1"), + Association: &ec2types.NetworkInterfaceAssociation{ + PublicIp: aws.String("1.1.1.1"), // already attached by other invocation + }, + }}, + }, nil + } + h := newTestHandler(mock) + h.attachEIP(context.Background(), "i-nat1", testAZ) + + if mock.callCount("AssociateAddress") != 0 { + t.Error("expected no AssociateAddress (race detected)") + } + if mock.callCount("ReleaseAddress") != 1 { + t.Errorf("expected ReleaseAddress=1 (duplicate released), got %d", mock.callCount("ReleaseAddress")) + } + }) + + t.Run("associate fails also releases duplicate", func(t *testing.T) { + mock := &mockEC2{} + eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil + } + mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { + return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-dup"), PublicIp: aws.String("2.2.2.2")}, nil + } + // Re-check: no EIP yet (race window) + mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { + return &ec2.DescribeNetworkInterfacesOutput{ + NetworkInterfaces: []ec2types.NetworkInterface{{ + NetworkInterfaceId: aws.String("eni-pub1"), + }}, + }, nil + } + // Associate fails (e.g. other invocation won the race) + mock.AssociateAddressFn = func(ctx context.Context, params *ec2.AssociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AssociateAddressOutput, error) { + return nil, fmt.Errorf("Resource.AlreadyAssociated: EIP already associated") + } + h := newTestHandler(mock) + h.attachEIP(context.Background(), "i-nat1", testAZ) + + if mock.callCount("ReleaseAddress") != 1 { + t.Errorf("expected ReleaseAddress=1 (orphaned alloc released), got %d", mock.callCount("ReleaseAddress")) + } + }) +} + +// TestRace_R7_AssociateFailsAfterRecheck verifies that when the re-check shows +// no EIP but AssociateAddress still fails (another invocation raced between +// re-check and associate), the allocated EIP is properly released. +// +// Race scenario: +// - Lambda A: AllocateAddress → re-check ENI → no EIP → AssociateAddress +// - Lambda B: between A's re-check and associate, B associates its own EIP +// - Lambda A: AssociateAddress fails +// +// Mitigation: attachEIP releases the allocated EIP on AssociateAddress failure. +func TestRace_R7_AssociateFailsAfterRecheck(t *testing.T) { + mock := &mockEC2{} + eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil + } + mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { + return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-orphan"), PublicIp: aws.String("3.3.3.3")}, nil + } + // Re-check: no EIP (race window still open) + mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { + return &ec2.DescribeNetworkInterfacesOutput{ + NetworkInterfaces: []ec2types.NetworkInterface{{ + NetworkInterfaceId: aws.String("eni-pub1"), + }}, + }, nil + } + // Associate fails: another invocation raced us + mock.AssociateAddressFn = func(ctx context.Context, params *ec2.AssociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AssociateAddressOutput, error) { + return nil, fmt.Errorf("InvalidParameterValue: EIP already in use") + } + h := newTestHandler(mock) + h.attachEIP(context.Background(), "i-nat1", testAZ) + + if mock.callCount("ReleaseAddress") != 1 { + t.Errorf("expected ReleaseAddress=1 (orphaned allocation), got %d", mock.callCount("ReleaseAddress")) + } +} + +// apiError implements smithy.APIError for test use. +type apiError struct { + code string + message string +} + +func (e *apiError) Error() string { return e.message } +func (e *apiError) ErrorCode() string { return e.code } +func (e *apiError) ErrorMessage() string { return e.message } +func (e *apiError) ErrorFault() smithy.ErrorFault { return smithy.FaultServer } + +// Ensure apiError satisfies the smithy.APIError interface. +var _ smithy.APIError = (*apiError)(nil) + +// TestRace_R8_DisassociateAlreadyRemoved verifies that detachEIP handles the +// case where EC2 auto-disassociates the EIP when the instance stops, before +// Lambda's detachEIP runs. +// +// Race scenario: +// - NAT instance stops, EC2 auto-disassociates the EIP from the ENI +// - Lambda's detachEIP fires, gets stale association data from DescribeNetworkInterfaces +// - DisassociateAddress returns InvalidAssociationID.NotFound +// +// Mitigation: detachEIP catches InvalidAssociationID.NotFound and still proceeds +// to release the EIP allocation. +func TestRace_R8_DisassociateAlreadyRemoved(t *testing.T) { + mock := &mockEC2{} + eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(makeTestInstance("i-nat1", "stopped", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil + } + // ENI still shows stale association data + mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { + return &ec2.DescribeNetworkInterfacesOutput{ + NetworkInterfaces: []ec2types.NetworkInterface{{ + NetworkInterfaceId: aws.String("eni-pub1"), + Association: &ec2types.NetworkInterfaceAssociation{ + AssociationId: aws.String("eipassoc-stale"), + AllocationId: aws.String("eipalloc-1"), + PublicIp: aws.String("1.2.3.4"), + }, + }}, + }, nil + } + // Disassociate fails: EC2 already removed it + mock.DisassociateAddressFn = func(ctx context.Context, params *ec2.DisassociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.DisassociateAddressOutput, error) { + return nil, &apiError{code: "InvalidAssociationID.NotFound", message: "Association not found"} + } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{Addresses: []ec2types.Address{}}, nil + } + h := newTestHandler(mock) + h.detachEIP(context.Background(), "i-nat1", testAZ) + + if mock.callCount("DisassociateAddress") != 1 { + t.Errorf("expected DisassociateAddress=1 (attempted), got %d", mock.callCount("DisassociateAddress")) + } + // Critical: ReleaseAddress must still be called despite disassociate "failure" + if mock.callCount("ReleaseAddress") != 1 { + t.Errorf("expected ReleaseAddress=1 (EIP freed despite NotFound), got %d", mock.callCount("ReleaseAddress")) + } +} + +// TestRace_R9_DisassociateNonNotFoundError verifies the current behavior when +// DisassociateAddress fails with a non-NotFound error (e.g. throttling). +// +// Race scenario: +// - Lambda calls DisassociateAddress but gets throttled +// - detachEIP returns early without releasing the EIP allocation +// +// UNMITIGATED: The EIP is leaked. However, the orphan sweep in a subsequent +// detachEIP invocation will clean it up. +func TestRace_R9_DisassociateNonNotFoundError(t *testing.T) { + t.Run("throttle error skips release (documents gap)", func(t *testing.T) { + mock := &mockEC2{} + eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(makeTestInstance("i-nat1", "stopped", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil + } + mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { + return &ec2.DescribeNetworkInterfacesOutput{ + NetworkInterfaces: []ec2types.NetworkInterface{{ + NetworkInterfaceId: aws.String("eni-pub1"), + Association: &ec2types.NetworkInterfaceAssociation{ + AssociationId: aws.String("eipassoc-1"), + AllocationId: aws.String("eipalloc-1"), + PublicIp: aws.String("1.2.3.4"), + }, + }}, + }, nil + } + // DisassociateAddress fails with throttle (not NotFound) + mock.DisassociateAddressFn = func(ctx context.Context, params *ec2.DisassociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.DisassociateAddressOutput, error) { + return nil, fmt.Errorf("Throttling: Rate exceeded") + } + h := newTestHandler(mock) + h.detachEIP(context.Background(), "i-nat1", testAZ) + + // Current behavior: returns early, ReleaseAddress NOT called (the gap) + if mock.callCount("ReleaseAddress") != 0 { + t.Error("expected ReleaseAddress=0 (current behavior: early return on non-NotFound error)") + } + // Orphan sweep also skipped because we returned early + if mock.callCount("DescribeAddresses") != 0 { + t.Error("expected DescribeAddresses=0 (orphan sweep skipped due to early return)") + } + }) + + t.Run("orphan sweep cleans up on next successful detach", func(t *testing.T) { + mock := &mockEC2{} + eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(makeTestInstance("i-nat1", "stopped", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil + } + // No current association (already gone) + mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { + return &ec2.DescribeNetworkInterfacesOutput{ + NetworkInterfaces: []ec2types.NetworkInterface{{ + NetworkInterfaceId: aws.String("eni-pub1"), + }}, + }, nil + } + // Orphan sweep finds the leaked EIP from previous failed detach + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{ + Addresses: []ec2types.Address{{ + AllocationId: aws.String("eipalloc-leaked"), + }}, + }, nil + } + h := newTestHandler(mock) + h.detachEIP(context.Background(), "i-nat1", testAZ) + + // Orphan sweep cleans up the leaked EIP + if mock.callCount("ReleaseAddress") != 1 { + t.Errorf("expected ReleaseAddress=1 (orphan sweep), got %d", mock.callCount("ReleaseAddress")) + } + }) +} + +// TestRace_R10_ENIAvailabilityTimeout verifies the behavior when ENIs never +// reach "available" status during replaceNAT, e.g. due to EC2 API delays. +// +// Race scenario: +// - replaceNAT terminates old NAT and waits for ENIs to become "available" +// - DescribeNetworkInterfaces keeps returning "in-use" (EC2 delay) +// - Wait loop exhausts all 60 iterations +// +// Accepted risk: createNAT proceeds anyway. The launch template may fail to +// attach the ENI, but the next workload event will retry. +func TestRace_R10_ENIAvailabilityTimeout(t *testing.T) { + mock := &mockEC2{} + natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} + + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + // All getInstance calls return terminated (waitForTermination succeeds immediately) + return describeResponse(makeTestInstance("i-old", "terminated", testVPC, testAZ, natTags, nil)), nil + } + // ENI never becomes available (always in-use) + mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { + return &ec2.DescribeNetworkInterfacesOutput{ + NetworkInterfaces: []ec2types.NetworkInterface{ + {NetworkInterfaceId: aws.String("eni-1"), Status: ec2types.NetworkInterfaceStatusInUse}, + }, + }, nil + } + mock.DescribeLaunchTemplatesFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplatesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplatesOutput, error) { + return &ec2.DescribeLaunchTemplatesOutput{ + LaunchTemplates: []ec2types.LaunchTemplate{{LaunchTemplateId: aws.String("lt-123")}}, + }, nil + } + mock.DescribeLaunchTemplateVersionsFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplateVersionsInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplateVersionsOutput, error) { + return &ec2.DescribeLaunchTemplateVersionsOutput{ + LaunchTemplateVersions: []ec2types.LaunchTemplateVersion{{ + LaunchTemplateId: aws.String("lt-123"), VersionNumber: aws.Int64(1), + }}, + }, nil + } + mock.DescribeImagesFn = func(ctx context.Context, params *ec2.DescribeImagesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeImagesOutput, error) { + return &ec2.DescribeImagesOutput{Images: []ec2types.Image{}}, nil + } + mock.RunInstancesFn = func(ctx context.Context, params *ec2.RunInstancesInput, optFns ...func(*ec2.Options)) (*ec2.RunInstancesOutput, error) { + return &ec2.RunInstancesOutput{ + Instances: []ec2types.Instance{{InstanceId: aws.String("i-new")}}, + }, nil + } + + h := newTestHandler(mock) + eni := makeENI("eni-1", 0, "10.0.1.10", nil) + inst := &Instance{ + InstanceID: "i-old", + StateName: "running", + NetworkInterfaces: []ec2types.InstanceNetworkInterface{eni}, + } + result := h.replaceNAT(context.Background(), inst, testAZ, testVPC) + + // createNAT still called despite ENI timeout (accepted risk) + if result != "i-new" { + t.Errorf("expected createNAT to proceed despite ENI timeout, got %q", result) + } + if mock.callCount("RunInstances") != 1 { + t.Errorf("expected RunInstances=1 (createNAT proceeded), got %d", mock.callCount("RunInstances")) + } + // ENI wait should have polled all 60 iterations + if mock.callCount("DescribeNetworkInterfaces") < 2 { + t.Errorf("expected multiple DescribeNetworkInterfaces polls, got %d", mock.callCount("DescribeNetworkInterfaces")) + } +} diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index 31ab0af..3a52e6b 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -283,3 +283,135 @@ Costs per AZ, per month. Includes the [AWS public IPv4 charge](https://aws.amazo | AMI | fck-nat AMI | fck-nat AMI (same) | | Complexity | Low (ASG only) | Higher (Lambda + EventBridge) | | Best for | Production 24/7 | Dev/staging, intermittent workloads | + +## Race Conditions + +Because multiple Lambda invocations can fire concurrently from overlapping EventBridge events, and because the EC2 API is eventually consistent, the Lambda must handle numerous race conditions. This section catalogs each identified race, its severity, and how (or whether) it is mitigated. + +### Race Condition Catalog + +| ID | Description | Trigger | Mitigation | Status | Test | +|----|-------------|---------|------------|--------|------| +| R1 | **Stale sibling from EC2 eventual consistency** — dying workload still shows as `running` in DescribeInstances | Scale-down event fires before EC2 API reflects the state change | `findSiblings` excludes trigger instance ID; `maybeStopNAT` retries 3x with 2s delay | MITIGATED | `TestRace_R1` | +| R2 | **Terminated instance gone from API** — `classify` returns `ignore=true`, scale-down event lost | Instance already purged from EC2 API by the time Lambda runs | Handler detects `isTerminating(state)` + `ignore` and calls `sweepIdleNATs` to check all NATs | MITIGATED | `TestRace_R2` | +| R3 | **Retry exhaustion** — EC2 consistency takes >6s (3x2s retries), false siblings persist | Unusually long EC2 API propagation delay | None — NAT stays running until next event or sweep catches it | ACCEPTED | `TestRace_R3` | +| R4 | **Duplicate NAT creation** — two concurrent workload events both see no NAT, both call `createNAT` | Two workloads start simultaneously in the same AZ | `findNAT` detects multiple NATs, keeps the first running one, terminates extras | MITIGATED | `TestRace_R4` | +| R5 | **Start/stop overlap** — scale-up starts NAT while concurrent scale-down stops it | Workload starts while last workload is terminating | `startNAT` waits for `stopped` state then starts; brief delay but correct | ACCEPTED | `TestRace_R5` | +| R6 | **Double EIP allocation** — concurrent pending+running events both allocate EIPs | Two EventBridge events for same NAT instance arrive concurrently | `attachEIP` re-checks ENI after `AllocateAddress`; releases duplicate if EIP already present | MITIGATED | `TestRace_R6` | +| R7 | **Associate fails after re-check** — another invocation associates between re-check and `AssociateAddress` | Very tight race window between DescribeNetworkInterfaces and AssociateAddress | `attachEIP` releases allocated EIP on `AssociateAddress` failure | MITIGATED | `TestRace_R7` | +| R8 | **Disassociate on already-removed association** — EC2 auto-disassociates EIP on stop before Lambda runs | EC2 instance stop completes and auto-removes EIP before `detachEIP` | `detachEIP` catches `InvalidAssociationID.NotFound` and still releases the allocation | MITIGATED | `TestRace_R8` | +| R9 | **Orphan EIP from non-NotFound error** — `DisassociateAddress` fails with throttle/other error | API throttling during EIP detach | `detachEIP` returns early without releasing; orphan sweep on next detach cleans up | UNMITIGATED | `TestRace_R9` | +| R10 | **ENI availability timeout** — ENI never reaches `available` after terminate | EC2 delay in releasing ENI from terminated instance | `replaceNAT` proceeds with `createNAT` after timeout; launch template may fail but next event retries | ACCEPTED | `TestRace_R10` | + +### Why Event-Driven NAT Has Races + +Traditional NAT (e.g. fck-nat with ASG) runs a single instance continuously — no concurrency, no races. nat-zero trades that simplicity for cost savings by reacting to events. This means: + +1. **Multiple triggers per lifecycle**: A single workload going from `pending` → `running` fires two EventBridge events, each invoking a separate Lambda. A NAT instance similarly fires `pending` → `running` → `stopping` → `stopped`, each potentially overlapping with workload events. + +2. **EC2 eventual consistency**: When EventBridge fires a `shutting-down` event, `DescribeInstances` may still return the instance as `running` for several seconds. This is the root cause of R1, R2, and R3. + +3. **No distributed lock**: Lambda invocations run independently with no shared state. The EC2 API itself is the only coordination point, and it's eventually consistent. + +### Sequence Diagrams + +#### R1: Stale Sibling (Scale-Down Race) + +``` + EventBridge Lambda A EC2 API + │ │ │ + │ shutting-down │ │ + │ (i-work1) │ │ + ├───────────────────>│ │ + │ │ DescribeInstances │ + │ │ (findSiblings, │ + │ │ exclude=i-work1) │ + │ ├─────────────────────>│ + │ │ │ i-work1 still shows + │ │<─────────────────────┤ "running" (stale!) + │ │ │ BUT excluded by ID + │ │ │ + │ │ No siblings found │ + │ │ StopInstances(NAT) │ + │ ├─────────────────────>│ + │ │ │ +``` + +Without the `excludeID` parameter, i-work1 would count as a sibling and the NAT would never stop. The retry loop (R3) handles cases where a *different* workload is stale. + +#### R4: Duplicate NAT Creation (Scale-Up Race) + +``` + EventBridge Lambda A Lambda B EC2 API + │ │ │ │ + │ pending │ │ │ + │ (i-work1) │ │ │ + ├───────────────>│ │ │ + │ pending │ │ │ + │ (i-work2) │ │ │ + ├───────────────────────────────────────>│ │ + │ │ │ │ + │ │ findNAT → nil │ │ + │ ├──────────────────────────────────────────>│ + │ │ │ findNAT → nil │ + │ │ ├───────────────────>│ + │ │ │ │ + │ │ RunInstances │ │ + │ │ → i-nat1 │ │ + │ ├──────────────────────────────────────────>│ + │ │ │ RunInstances │ + │ │ │ → i-nat2 │ + │ │ ├───────────────────>│ + │ │ │ │ + │ ┌─────┴──────────────────────┴─────┐ │ + │ │ Later: any findNAT call sees │ │ + │ │ both i-nat1 and i-nat2 │ │ + │ │ → keeps first running NAT │ │ + │ │ → TerminateInstances(extra) │ │ + │ └──────────────────────────────────┘ │ +``` + +#### R6: Double EIP Allocation (Concurrent attachEIP) + +``` + EventBridge Lambda A Lambda B EC2 API + │ │ │ │ + │ pending │ │ │ + │ (NAT) │ │ │ + ├───────────────>│ │ │ + │ running │ │ │ + │ (NAT) │ │ │ + ├───────────────────────────────────────>│ │ + │ │ │ │ + │ │ Check ENI: no EIP │ │ + │ ├──────────────────────────────────────────>│ + │ │ │ Check ENI: no EIP │ + │ │ ├───────────────────>│ + │ │ │ │ + │ │ AllocateAddress │ │ + │ │ → eipalloc-A │ │ + │ ├──────────────────────────────────────────>│ + │ │ │ AllocateAddress │ + │ │ │ → eipalloc-B │ + │ │ ├───────────────────>│ + │ │ │ │ + │ │ Re-check ENI: │ │ + │ │ still no EIP │ │ + │ ├──────────────────────────────────────────>│ + │ │ │ │ + │ │ AssociateAddress │ │ + │ │ (eipalloc-A) │ │ + │ ├──────────────────────────────────────────>│ + │ │ │ │ + │ │ │ Re-check ENI: │ + │ │ │ EIP-A present! │ + │ │ ├───────────────────>│ + │ │ │ │ + │ │ │ Race detected! │ + │ │ │ ReleaseAddress │ + │ │ │ (eipalloc-B) │ + │ │ ├───────────────────>│ + │ │ │ │ +``` + +If Lambda B's re-check also misses EIP-A (very tight window), `AssociateAddress` will fail and Lambda B releases eipalloc-B in the error handler. The orphan sweep in `detachEIP` provides a final safety net. From 2e9dab556892f49ceecc8a1e869d14eef7a21680 Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Wed, 25 Feb 2026 13:14:24 +1000 Subject: [PATCH 10/30] style: fix gofmt alignment in race_test.go Co-Authored-By: Claude Opus 4.6 --- cmd/lambda/race_test.go | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/cmd/lambda/race_test.go b/cmd/lambda/race_test.go index 21ca3f2..58a960e 100644 --- a/cmd/lambda/race_test.go +++ b/cmd/lambda/race_test.go @@ -481,9 +481,9 @@ type apiError struct { message string } -func (e *apiError) Error() string { return e.message } -func (e *apiError) ErrorCode() string { return e.code } -func (e *apiError) ErrorMessage() string { return e.message } +func (e *apiError) Error() string { return e.message } +func (e *apiError) ErrorCode() string { return e.code } +func (e *apiError) ErrorMessage() string { return e.message } func (e *apiError) ErrorFault() smithy.ErrorFault { return smithy.FaultServer } // Ensure apiError satisfies the smithy.APIError interface. From 4a9c336510693c581ee74e2d878d0c971c51732b Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Wed, 25 Feb 2026 13:42:32 +1000 Subject: [PATCH 11/30] fix: sweep orphan EIPs on NAT termination (R11) Extract sweepOrphanEIPs from detachEIP so it can be called independently. Add isTerminating branch to NAT event handler to sweep orphan EIPs when a NAT is terminated without a stop cycle (e.g. replaceNAT, spot reclaim). Also document R12 (sweepIdleNATs lacks retry) as an accepted risk. Co-Authored-By: Claude Opus 4.6 --- cmd/lambda/ec2ops.go | 9 ++- cmd/lambda/handler.go | 5 ++ cmd/lambda/handler_test.go | 10 ++- cmd/lambda/race_test.go | 136 +++++++++++++++++++++++++++++++++++++ docs/ARCHITECTURE.md | 2 + 5 files changed, 157 insertions(+), 5 deletions(-) diff --git a/cmd/lambda/ec2ops.go b/cmd/lambda/ec2ops.go index 96141e0..8c5e06d 100644 --- a/cmd/lambda/ec2ops.go +++ b/cmd/lambda/ec2ops.go @@ -322,8 +322,13 @@ func (h *Handler) detachEIP(ctx context.Context, instanceID, az string) { } } - // Orphan sweep: release any EIPs tagged for this AZ that were left behind - // by a concurrent attachEIP race. + h.sweepOrphanEIPs(ctx, az) +} + +// sweepOrphanEIPs releases any EIPs tagged for this AZ that were left behind +// by concurrent attachEIP races or NAT termination without a stop cycle. +func (h *Handler) sweepOrphanEIPs(ctx context.Context, az string) { + defer timed("sweep_orphan_eips")() addrResp, err := h.EC2.DescribeAddresses(ctx, &ec2.DescribeAddressesInput{ Filters: []ec2types.Filter{ {Name: aws.String("tag:" + h.NATTagKey), Values: []string{h.NATTagValue}}, diff --git a/cmd/lambda/handler.go b/cmd/lambda/handler.go index 73400b9..fba4eac 100644 --- a/cmd/lambda/handler.go +++ b/cmd/lambda/handler.go @@ -63,6 +63,11 @@ func (h *Handler) handle(ctx context.Context, event Event) error { h.attachEIP(ctx, iid, az) } else if isStopping(state) { h.detachEIP(ctx, iid, az) + } else if isTerminating(state) { + // R11: NAT terminated without a stop cycle (e.g. replaceNAT, + // spot reclaim, manual termination). The stopping/stopped events + // that trigger detachEIP will never fire, so sweep orphan EIPs. + h.sweepOrphanEIPs(ctx, az) } return nil } diff --git a/cmd/lambda/handler_test.go b/cmd/lambda/handler_test.go index 0297f1c..931d236 100644 --- a/cmd/lambda/handler_test.go +++ b/cmd/lambda/handler_test.go @@ -242,7 +242,7 @@ func TestHandlerNatEvents(t *testing.T) { } }) - t.Run("terminated NAT is noop", func(t *testing.T) { + t.Run("terminated NAT sweeps orphan EIPs", func(t *testing.T) { mock := &mockEC2{} natInst := makeTestInstance("i-nat1", "terminated", testVPC, testAZ, natTags, nil) mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { @@ -253,8 +253,12 @@ func TestHandlerNatEvents(t *testing.T) { if err != nil { t.Fatalf("unexpected error: %v", err) } - if mock.callCount("AllocateAddress") != 0 || mock.callCount("DisassociateAddress") != 0 { - t.Error("expected no EIP operations for terminated NAT") + if mock.callCount("AllocateAddress") != 0 { + t.Error("expected no AllocateAddress for terminated NAT") + } + // sweepOrphanEIPs runs (DescribeAddresses called) + if mock.callCount("DescribeAddresses") != 1 { + t.Errorf("expected DescribeAddresses=1 (orphan sweep), got %d", mock.callCount("DescribeAddresses")) } }) } diff --git a/cmd/lambda/race_test.go b/cmd/lambda/race_test.go index 58a960e..f81b8e1 100644 --- a/cmd/lambda/race_test.go +++ b/cmd/lambda/race_test.go @@ -683,3 +683,139 @@ func TestRace_R10_ENIAvailabilityTimeout(t *testing.T) { t.Errorf("expected multiple DescribeNetworkInterfaces polls, got %d", mock.callCount("DescribeNetworkInterfaces")) } } + +// TestRace_R11_EIPOrphanOnNATTermination verifies that when a NAT instance is +// terminated (not stopped), orphan EIPs are cleaned up via sweepOrphanEIPs. +// +// Race scenario: +// - NAT is terminated by replaceNAT, spot reclaim, or manual action +// - No stopping/stopped EventBridge events fire, so detachEIP never runs +// - The EIP allocation leaks (still allocated, no longer associated) +// +// Mitigation: handler detects isTerminating(state) for NAT events and calls +// sweepOrphanEIPs to release any EIPs tagged for that AZ. +func TestRace_R11_EIPOrphanOnNATTermination(t *testing.T) { + t.Run("shutting-down NAT sweeps orphan EIPs", func(t *testing.T) { + mock := &mockEC2{} + natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} + natInst := makeTestInstance("i-nat1", "shutting-down", testVPC, testAZ, natTags, nil) + + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(natInst), nil + } + // Orphan EIP left behind from the now-terminating NAT + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{ + Addresses: []ec2types.Address{{ + AllocationId: aws.String("eipalloc-orphan"), + }}, + }, nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-nat1", State: "shutting-down"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("ReleaseAddress") != 1 { + t.Errorf("expected ReleaseAddress=1 (orphan EIP cleaned up), got %d", mock.callCount("ReleaseAddress")) + } + }) + + t.Run("terminated NAT sweeps orphan EIPs", func(t *testing.T) { + mock := &mockEC2{} + natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} + natInst := makeTestInstance("i-nat1", "terminated", testVPC, testAZ, natTags, nil) + + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(natInst), nil + } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{ + Addresses: []ec2types.Address{{ + AllocationId: aws.String("eipalloc-orphan"), + AssociationId: aws.String("eipassoc-stale"), + }}, + }, nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-nat1", State: "terminated"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + // Should disassociate (stale) then release + if mock.callCount("DisassociateAddress") != 1 { + t.Errorf("expected DisassociateAddress=1, got %d", mock.callCount("DisassociateAddress")) + } + if mock.callCount("ReleaseAddress") != 1 { + t.Errorf("expected ReleaseAddress=1, got %d", mock.callCount("ReleaseAddress")) + } + }) + + t.Run("no orphan EIPs is noop", func(t *testing.T) { + mock := &mockEC2{} + natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} + natInst := makeTestInstance("i-nat1", "shutting-down", testVPC, testAZ, natTags, nil) + + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + return describeResponse(natInst), nil + } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{Addresses: []ec2types.Address{}}, nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-nat1", State: "shutting-down"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("ReleaseAddress") != 0 { + t.Error("expected no ReleaseAddress when no orphans") + } + }) +} + +// TestRace_R12_SweepIdleNATsLacksRetry documents that sweepIdleNATs calls +// findSiblings once per NAT without the retry loop that maybeStopNAT uses. +// +// Race scenario: +// - sweepIdleNATs fires (R2 fallback: classify can't find trigger instance) +// - findSiblings returns a stale sibling due to EC2 eventual consistency +// - NAT is not stopped because it appears to have active workloads +// +// Accepted risk: sweepIdleNATs is itself a fallback for the rare case where +// both shutting-down and terminated events fail to classify. Adding retry +// here would compound Lambda execution time for a path that rarely fires. +// The next lifecycle event will eventually stop the NAT. +func TestRace_R12_SweepIdleNATsLacksRetry(t *testing.T) { + mock := &mockEC2{} + natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} + workTags := []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}} + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) + staleInst := makeTestInstance("i-stale", "running", testVPC, testAZ, workTags, nil) + + var callIdx int32 + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + idx := atomic.AddInt32(&callIdx, 1) + if idx == 1 { + // classify: instance gone + return describeResponse(), nil + } + if params.Filters != nil { + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil + } + } + } + // findSiblings: stale sibling (no retry in sweep path) + return describeResponse(staleInst), nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-gone", State: "terminated"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + // Accepted risk: NAT not stopped because stale sibling found (no retry) + if mock.callCount("StopInstances") != 0 { + t.Error("expected StopInstances=0 (sweep has no retry, stale sibling blocks stop)") + } +} diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index 3a52e6b..7fb7182 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -302,6 +302,8 @@ Because multiple Lambda invocations can fire concurrently from overlapping Event | R8 | **Disassociate on already-removed association** — EC2 auto-disassociates EIP on stop before Lambda runs | EC2 instance stop completes and auto-removes EIP before `detachEIP` | `detachEIP` catches `InvalidAssociationID.NotFound` and still releases the allocation | MITIGATED | `TestRace_R8` | | R9 | **Orphan EIP from non-NotFound error** — `DisassociateAddress` fails with throttle/other error | API throttling during EIP detach | `detachEIP` returns early without releasing; orphan sweep on next detach cleans up | UNMITIGATED | `TestRace_R9` | | R10 | **ENI availability timeout** — ENI never reaches `available` after terminate | EC2 delay in releasing ENI from terminated instance | `replaceNAT` proceeds with `createNAT` after timeout; launch template may fail but next event retries | ACCEPTED | `TestRace_R10` | +| R11 | **EIP orphan on NAT termination** — NAT terminated without stop cycle, `detachEIP` never fires | `replaceNAT`, spot reclaim, manual termination | Handler detects `isTerminating(state)` for NAT events and calls `sweepOrphanEIPs` to release tagged EIPs | MITIGATED | `TestRace_R11` | +| R12 | **sweepIdleNATs lacks retry** — stale sibling blocks sweep from stopping idle NAT | EC2 eventual consistency during fallback sweep path | None — sweep is itself a rare fallback (R2); retry budget would compound Lambda execution time | ACCEPTED | `TestRace_R12` | ### Why Event-Driven NAT Has Races From 4d99138567f4469e05c26db9ab5a2afdd187a511 Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Wed, 25 Feb 2026 15:26:29 +1000 Subject: [PATCH 12/30] refactor: replace event-driven logic with reconciliation + reserved concurrency MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Eliminate all race conditions by setting reserved_concurrent_executions=1 (single writer) and replacing the reactive classify/ensureNAT/maybeStopNAT logic with a reconciliation loop that observes current state and takes at most one action per invocation. Key changes: - handler.go: resolveAZ + reconcile replace classify + event branching - ec2ops.go: findWorkloads/findNATs/findEIPs replace findSiblings/sweepIdleNATs/sweepOrphanEIPs - Remove waitForState, waitForTermination, replaceNAT, all polling loops - Remove SleepFunc (no sleeping in the reconciler) - lambda.tf: timeout 300→30s, add reserved_concurrent_executions=1 - Delete race_test.go (no race conditions with single writer) - Update ARCHITECTURE.md to document reconciliation model Production code: ~725 lines (down from ~900), race conditions: 0 (down from 12) Co-Authored-By: Claude Opus 4.6 --- cmd/lambda/ec2ops.go | 443 +++++----------- cmd/lambda/ec2ops_test.go | 1014 ++++++++---------------------------- cmd/lambda/handler.go | 190 ++++--- cmd/lambda/handler_test.go | 865 ++++++++++++++---------------- cmd/lambda/mock_test.go | 2 - cmd/lambda/race_test.go | 821 ----------------------------- docs/ARCHITECTURE.md | 360 ++++--------- lambda.tf | 21 +- 8 files changed, 955 insertions(+), 2761 deletions(-) delete mode 100644 cmd/lambda/race_test.go diff --git a/cmd/lambda/ec2ops.go b/cmd/lambda/ec2ops.go index 8c5e06d..c93b1d5 100644 --- a/cmd/lambda/ec2ops.go +++ b/cmd/lambda/ec2ops.go @@ -7,7 +7,6 @@ import ( "log" "sort" "strings" - "time" "github.com/aws/aws-sdk-go-v2/aws" "github.com/aws/aws-sdk-go-v2/service/ec2" @@ -63,42 +62,43 @@ func (h *Handler) getInstance(ctx context.Context, instanceID string) *Instance return instanceFromAPI(resp.Reservations[0].Instances[0]) } -func (h *Handler) classify(ctx context.Context, instanceID string) (ignore, isNAT bool, az, vpc string) { - defer timed("classify")() - inst := h.getInstance(ctx, instanceID) - if inst == nil { - return true, false, "", "" - } - if inst.VpcID != h.TargetVPC { - return true, false, "", "" - } - if hasTag(inst.Tags, h.IgnoreTagKey, h.IgnoreTagValue) { - return true, false, inst.AZ, inst.VpcID +// --- Reconciliation queries --- + +// findWorkloads returns all pending/running instances in the AZ that are not +// NAT instances and not ignored. +func (h *Handler) findWorkloads(ctx context.Context, az, vpc string) []*Instance { + defer timed("find_workloads")() + resp, err := h.EC2.DescribeInstances(ctx, &ec2.DescribeInstancesInput{ + Filters: []ec2types.Filter{ + {Name: aws.String("availability-zone"), Values: []string{az}}, + {Name: aws.String("vpc-id"), Values: []string{vpc}}, + {Name: aws.String("instance-state-name"), Values: []string{"pending", "running"}}, + }, + }) + if err != nil { + log.Printf("Error finding workloads: %v", err) + return nil } - return false, hasTag(inst.Tags, h.NATTagKey, h.NATTagValue), inst.AZ, inst.VpcID -} -func (h *Handler) waitForState(ctx context.Context, instanceID string, states []string, timeout int) bool { - iterations := timeout / 2 - for i := 0; i < iterations; i++ { - inst := h.getInstance(ctx, instanceID) - if inst == nil { - return false - } - for _, s := range states { - if inst.StateName == s { - return true + var workloads []*Instance + for _, r := range resp.Reservations { + for _, i := range r.Instances { + inst := instanceFromAPI(i) + if hasTag(inst.Tags, h.NATTagKey, h.NATTagValue) { + continue + } + if hasTag(inst.Tags, h.IgnoreTagKey, h.IgnoreTagValue) { + continue } + workloads = append(workloads, inst) } - h.sleep(2 * time.Second) } - log.Printf("Timeout: %s never reached %v", instanceID, states) - return false + return workloads } -// findNAT finds the NAT instance in an AZ. Deduplicates if multiple exist. -func (h *Handler) findNAT(ctx context.Context, az, vpc string) *Instance { - defer timed("find_nat")() +// findNATs returns all NAT instances in an AZ (any non-terminated state). +func (h *Handler) findNATs(ctx context.Context, az, vpc string) []*Instance { + defer timed("find_nats")() resp, err := h.EC2.DescribeInstances(ctx, &ec2.DescribeInstancesInput{ Filters: []ec2types.Filter{ {Name: aws.String("tag:" + h.NATTagKey), Values: []string{h.NATTagValue}}, @@ -108,7 +108,7 @@ func (h *Handler) findNAT(ctx context.Context, az, vpc string) *Instance { }, }) if err != nil { - log.Printf("Error finding NAT: %v", err) + log.Printf("Error finding NATs: %v", err) return nil } @@ -118,19 +118,59 @@ func (h *Handler) findNAT(ctx context.Context, az, vpc string) *Instance { nats = append(nats, instanceFromAPI(i)) } } + return nats +} - if len(nats) == 0 { +// findEIPs returns all EIPs tagged for this AZ. +func (h *Handler) findEIPs(ctx context.Context, az string) []ec2types.Address { + defer timed("find_eips")() + resp, err := h.EC2.DescribeAddresses(ctx, &ec2.DescribeAddressesInput{ + Filters: []ec2types.Filter{ + {Name: aws.String("tag:" + h.NATTagKey), Values: []string{h.NATTagValue}}, + {Name: aws.String("tag:AZ"), Values: []string{az}}, + }, + }) + if err != nil { + log.Printf("Error finding EIPs: %v", err) + return nil + } + return resp.Addresses +} + +// findConfiguredAZs returns the AZs that have a launch template configured. +func (h *Handler) findConfiguredAZs(ctx context.Context) []string { + defer timed("find_configured_azs")() + resp, err := h.EC2.DescribeLaunchTemplates(ctx, &ec2.DescribeLaunchTemplatesInput{ + Filters: []ec2types.Filter{ + {Name: aws.String("tag:VpcId"), Values: []string{h.TargetVPC}}, + }, + }) + if err != nil || len(resp.LaunchTemplates) == 0 { return nil } - if len(nats) == 1 { - return nats[0] + + var azs []string + for _, lt := range resp.LaunchTemplates { + for _, tag := range lt.Tags { + if aws.ToString(tag.Key) == "AvailabilityZone" { + azs = append(azs, aws.ToString(tag.Value)) + } + } } + return azs +} + +// --- Reconciliation actions --- + +// terminateDuplicateNATs keeps the best NAT (prefer running) and terminates the rest. +// Returns the kept NAT as a single-element slice. +func (h *Handler) terminateDuplicateNATs(ctx context.Context, nats []*Instance) []*Instance { + log.Printf("%d NAT instances found, deduplicating", len(nats)) - // Race condition: multiple NATs. Keep the running one, terminate extras. - log.Printf("%d NAT instances in %s, deduplicating", len(nats), az) + // Prefer running instances. var running []*Instance for _, n := range nats { - if isStarting(n.StateName) { + if n.StateName == "pending" || n.StateName == "running" { running = append(running, n) } } @@ -138,88 +178,66 @@ func (h *Handler) findNAT(ctx context.Context, az, vpc string) *Instance { if len(running) > 0 { keep = running[0] } + for _, n := range nats { if n.InstanceID != keep.InstanceID { log.Printf("Terminating duplicate NAT %s", n.InstanceID) - _, err := h.EC2.TerminateInstances(ctx, &ec2.TerminateInstancesInput{ - InstanceIds: []string{n.InstanceID}, - }) - if err != nil { - log.Printf("Failed to terminate %s: %v", n.InstanceID, err) - } + h.terminateInstance(ctx, n.InstanceID) } } - return keep + return []*Instance{keep} } -func (h *Handler) findSiblings(ctx context.Context, az, vpc, excludeID string) []*Instance { - defer timed("find_siblings")() - resp, err := h.EC2.DescribeInstances(ctx, &ec2.DescribeInstancesInput{ - Filters: []ec2types.Filter{ - {Name: aws.String("availability-zone"), Values: []string{az}}, - {Name: aws.String("vpc-id"), Values: []string{vpc}}, - {Name: aws.String("instance-state-name"), Values: []string{"pending", "running"}}, - }, +func (h *Handler) terminateInstance(ctx context.Context, instanceID string) { + _, err := h.EC2.TerminateInstances(ctx, &ec2.TerminateInstancesInput{ + InstanceIds: []string{instanceID}, }) if err != nil { - log.Printf("Error finding siblings: %v", err) - return nil + log.Printf("Failed to terminate %s: %v", instanceID, err) } +} - var siblings []*Instance - for _, r := range resp.Reservations { - for _, i := range r.Instances { - inst := instanceFromAPI(i) - if inst.InstanceID == excludeID { - continue - } - if !hasTag(inst.Tags, h.NATTagKey, h.NATTagValue) && - !hasTag(inst.Tags, h.IgnoreTagKey, h.IgnoreTagValue) { - siblings = append(siblings, inst) - } - } +func (h *Handler) startInstance(ctx context.Context, instanceID string) { + _, err := h.EC2.StartInstances(ctx, &ec2.StartInstancesInput{ + InstanceIds: []string{instanceID}, + }) + if err != nil { + log.Printf("Failed to start %s: %v", instanceID, err) + } else { + log.Printf("Started %s", instanceID) } - return siblings } -// --- EIP management (EventBridge-driven) --- - -func getPublicENI(inst *Instance) *ec2types.InstanceNetworkInterface { - for i := range inst.NetworkInterfaces { - if aws.ToInt32(inst.NetworkInterfaces[i].Attachment.DeviceIndex) == 0 { - return &inst.NetworkInterfaces[i] - } +func (h *Handler) stopInstance(ctx context.Context, instanceID string) { + _, err := h.EC2.StopInstances(ctx, &ec2.StopInstancesInput{ + InstanceIds: []string{instanceID}, + Force: aws.Bool(true), + }) + if err != nil { + log.Printf("Failed to stop %s: %v", instanceID, err) + } else { + log.Printf("Stopped %s", instanceID) } - return nil } -// attachEIP waits for the NAT instance to reach "running", then allocates and -// associates an EIP to the public ENI. Idempotent: no-op if ENI already has an EIP. -func (h *Handler) attachEIP(ctx context.Context, instanceID, az string) { - defer timed("attach_eip")() - - if !h.waitForState(ctx, instanceID, []string{"running"}, 120) { - return - } +// allocateAndAttachEIP allocates an EIP and associates it to the NAT's public ENI. +func (h *Handler) allocateAndAttachEIP(ctx context.Context, nat *Instance, az string) { + defer timed("allocate_and_attach_eip")() - inst := h.getInstance(ctx, instanceID) - if inst == nil { - return - } - eni := getPublicENI(inst) + eni := getPublicENI(nat) if eni == nil { - log.Printf("No public ENI on %s", instanceID) + log.Printf("No public ENI on %s", nat.InstanceID) return } - // Idempotent: if ENI already has an EIP, nothing to do. + eniID := aws.ToString(eni.NetworkInterfaceId) + + // If ENI already has an EIP (e.g. EIP tag query lagged), skip. if eni.Association != nil && aws.ToString(eni.Association.PublicIp) != "" { - log.Printf("ENI %s already has EIP %s", aws.ToString(eni.NetworkInterfaceId), aws.ToString(eni.Association.PublicIp)) + log.Printf("ENI %s already has EIP %s", eniID, aws.ToString(eni.Association.PublicIp)) return } - eniID := aws.ToString(eni.NetworkInterfaceId) - alloc, err := h.EC2.AllocateAddress(ctx, &ec2.AllocateAddressInput{ Domain: ec2types.DomainTypeVpc, TagSpecifications: []ec2types.TagSpecification{{ @@ -237,23 +255,6 @@ func (h *Handler) attachEIP(ctx context.Context, instanceID, az string) { } allocID := aws.ToString(alloc.AllocationId) - // Race-detection: re-check ENI before associating. Another invocation may - // have already attached an EIP between our first check and AllocateAddress. - niResp, descErr := h.EC2.DescribeNetworkInterfaces(ctx, &ec2.DescribeNetworkInterfacesInput{ - NetworkInterfaceIds: []string{eniID}, - }) - if descErr == nil && len(niResp.NetworkInterfaces) > 0 { - ni := niResp.NetworkInterfaces[0] - if ni.Association != nil && aws.ToString(ni.Association.PublicIp) != "" { - log.Printf("Race detected: ENI %s already has EIP %s, releasing %s", - eniID, aws.ToString(ni.Association.PublicIp), allocID) - h.EC2.ReleaseAddress(ctx, &ec2.ReleaseAddressInput{AllocationId: aws.String(allocID)}) - return - } - } else if descErr != nil { - log.Printf("Failed to re-check ENI %s (proceeding with associate): %v", eniID, descErr) - } - _, err = h.EC2.AssociateAddress(ctx, &ec2.AssociateAddressInput{ AllocationId: aws.String(allocID), NetworkInterfaceId: aws.String(eniID), @@ -266,95 +267,38 @@ func (h *Handler) attachEIP(ctx context.Context, instanceID, az string) { log.Printf("Attached EIP %s to %s", aws.ToString(alloc.PublicIp), eniID) } -// detachEIP waits for the NAT instance to reach "stopped", then disassociates -// and releases the EIP from the public ENI. Also sweeps for orphaned EIPs -// left by concurrent attachEIP races. -func (h *Handler) detachEIP(ctx context.Context, instanceID, az string) { - defer timed("detach_eip")() - - if !h.waitForState(ctx, instanceID, []string{"stopped"}, 120) { - return - } - - inst := h.getInstance(ctx, instanceID) - if inst == nil { - return - } - eni := getPublicENI(inst) - if eni == nil { - return - } - eniID := aws.ToString(eni.NetworkInterfaceId) - - niResp, err := h.EC2.DescribeNetworkInterfaces(ctx, &ec2.DescribeNetworkInterfacesInput{ - NetworkInterfaceIds: []string{eniID}, - }) - if err != nil { - log.Printf("Failed to describe ENI %s: %v", eniID, err) - return - } - if len(niResp.NetworkInterfaces) > 0 { - ni := niResp.NetworkInterfaces[0] - if ni.Association != nil && aws.ToString(ni.Association.AssociationId) != "" { - assocID := aws.ToString(ni.Association.AssociationId) - allocID := aws.ToString(ni.Association.AllocationId) - publicIP := aws.ToString(ni.Association.PublicIp) - - _, err = h.EC2.DisassociateAddress(ctx, &ec2.DisassociateAddressInput{ - AssociationId: aws.String(assocID), - }) - if err != nil { - if isErrCode(err, "InvalidAssociationID.NotFound") { - log.Printf("EIP already disassociated from %s", eniID) - } else { - log.Printf("Failed to detach EIP from %s: %v", eniID, err) - return - } - } - _, err = h.EC2.ReleaseAddress(ctx, &ec2.ReleaseAddressInput{ - AllocationId: aws.String(allocID), - }) - if err != nil { - log.Printf("Failed to release EIP %s: %v", allocID, err) - } else { - log.Printf("Released EIP %s from %s", publicIP, eniID) - } - } - } - - h.sweepOrphanEIPs(ctx, az) -} - -// sweepOrphanEIPs releases any EIPs tagged for this AZ that were left behind -// by concurrent attachEIP races or NAT termination without a stop cycle. -func (h *Handler) sweepOrphanEIPs(ctx context.Context, az string) { - defer timed("sweep_orphan_eips")() - addrResp, err := h.EC2.DescribeAddresses(ctx, &ec2.DescribeAddressesInput{ - Filters: []ec2types.Filter{ - {Name: aws.String("tag:" + h.NATTagKey), Values: []string{h.NATTagValue}}, - {Name: aws.String("tag:AZ"), Values: []string{az}}, - }, - }) - if err != nil { - log.Printf("Orphan EIP sweep failed for %s: %v", az, err) - return - } - for _, addr := range addrResp.Addresses { - orphanAllocID := aws.ToString(addr.AllocationId) +// releaseEIPs disassociates and releases a list of EIPs. +func (h *Handler) releaseEIPs(ctx context.Context, eips []ec2types.Address) { + for _, addr := range eips { + allocID := aws.ToString(addr.AllocationId) if addr.AssociationId != nil { - h.EC2.DisassociateAddress(ctx, &ec2.DisassociateAddressInput{ + _, err := h.EC2.DisassociateAddress(ctx, &ec2.DisassociateAddressInput{ AssociationId: addr.AssociationId, }) + if err != nil && !isErrCode(err, "InvalidAssociationID.NotFound") { + log.Printf("Failed to disassociate EIP %s: %v", allocID, err) + } } _, err := h.EC2.ReleaseAddress(ctx, &ec2.ReleaseAddressInput{ - AllocationId: aws.String(orphanAllocID), + AllocationId: aws.String(allocID), }) if err != nil { - log.Printf("Failed to release orphan EIP %s: %v", orphanAllocID, err) + log.Printf("Failed to release EIP %s: %v", allocID, err) } else { - log.Printf("Released orphan EIP %s in %s", orphanAllocID, az) + log.Printf("Released EIP %s", allocID) + } + } +} + +// --- ENI helper --- + +func getPublicENI(inst *Instance) *ec2types.InstanceNetworkInterface { + for i := range inst.NetworkInterfaces { + if aws.ToInt32(inst.NetworkInterfaces[i].Attachment.DeviceIndex) == 0 { + return &inst.NetworkInterfaces[i] } } + return nil } // --- Config version --- @@ -371,58 +315,6 @@ func (h *Handler) isCurrentConfig(inst *Instance) bool { return true // no tag to compare — assume current } -func (h *Handler) replaceNAT(ctx context.Context, inst *Instance, az, vpc string) string { - defer timed("replace_nat")() - iid := inst.InstanceID - var eniIDs []string - for _, eni := range inst.NetworkInterfaces { - eniIDs = append(eniIDs, aws.ToString(eni.NetworkInterfaceId)) - } - - log.Printf("Replacing outdated NAT %s in %s", iid, az) - h.EC2.TerminateInstances(ctx, &ec2.TerminateInstancesInput{ - InstanceIds: []string{iid}, - }) - - // Wait for termination using polling. - h.waitForTermination(ctx, iid) - - // Wait for ENIs to become available. - if len(eniIDs) > 0 { - for i := 0; i < 60; i++ { - niResp, err := h.EC2.DescribeNetworkInterfaces(ctx, &ec2.DescribeNetworkInterfacesInput{ - NetworkInterfaceIds: eniIDs, - }) - if err == nil { - allAvailable := true - for _, ni := range niResp.NetworkInterfaces { - if ni.Status != ec2types.NetworkInterfaceStatusAvailable { - allAvailable = false - break - } - } - if allAvailable { - break - } - } - h.sleep(2 * time.Second) - } - } - - return h.createNAT(ctx, az, vpc) -} - -func (h *Handler) waitForTermination(ctx context.Context, instanceID string) { - for i := 0; i < 100; i++ { - inst := h.getInstance(ctx, instanceID) - if inst == nil || inst.StateName == "terminated" { - return - } - h.sleep(2 * time.Second) - } - log.Printf("Timeout waiting for %s to terminate", instanceID) -} - // --- NAT lifecycle helpers --- func (h *Handler) resolveAMI(ctx context.Context) string { @@ -442,7 +334,6 @@ func (h *Handler) resolveAMI(ctx context.Context) string { return "" } - // Pick the latest by CreationDate. images := resp.Images sort.Slice(images, func(i, j int) bool { return aws.ToString(images[i].CreationDate) > aws.ToString(images[j].CreationDate) @@ -522,63 +413,6 @@ func (h *Handler) createNAT(ctx context.Context, az, vpc string) string { return iid } -func (h *Handler) startNAT(ctx context.Context, inst *Instance, az string) { - defer timed("start_nat")() - iid := inst.InstanceID - if !h.waitForState(ctx, iid, []string{"stopped"}, 90) { - return - } - _, err := h.EC2.StartInstances(ctx, &ec2.StartInstancesInput{ - InstanceIds: []string{iid}, - }) - if err != nil { - log.Printf("Failed to start NAT %s: %v", iid, err) - return - } - log.Printf("Started NAT %s", iid) -} - -func (h *Handler) stopNAT(ctx context.Context, inst *Instance) { - defer timed("stop_nat")() - iid := inst.InstanceID - _, err := h.EC2.StopInstances(ctx, &ec2.StopInstancesInput{ - InstanceIds: []string{iid}, - Force: aws.Bool(true), - }) - if err != nil { - log.Printf("Failed to stop NAT %s: %v", iid, err) - return - } - log.Printf("Stopped NAT %s", iid) -} - -// sweepIdleNATs is a fallback for when classify can't find the triggering -// instance (e.g. it's already gone from the EC2 API after termination). -// It checks every running NAT in the VPC and stops any with no siblings. -func (h *Handler) sweepIdleNATs(ctx context.Context, triggerID string) { - defer timed("sweep_idle_nats")() - resp, err := h.EC2.DescribeInstances(ctx, &ec2.DescribeInstancesInput{ - Filters: []ec2types.Filter{ - {Name: aws.String("tag:" + h.NATTagKey), Values: []string{h.NATTagValue}}, - {Name: aws.String("vpc-id"), Values: []string{h.TargetVPC}}, - {Name: aws.String("instance-state-name"), Values: []string{"pending", "running"}}, - }, - }) - if err != nil { - log.Printf("Sweep failed: %v", err) - return - } - for _, r := range resp.Reservations { - for _, i := range r.Instances { - nat := instanceFromAPI(i) - if len(h.findSiblings(ctx, nat.AZ, nat.VpcID, triggerID)) == 0 { - log.Printf("Sweep: no siblings for NAT %s in %s, stopping", nat.InstanceID, nat.AZ) - h.stopNAT(ctx, nat) - } - } - } -} - // --- Cleanup (destroy-time) --- func (h *Handler) cleanupAll(ctx context.Context) { @@ -610,7 +444,7 @@ func (h *Handler) cleanupAll(ctx context.Context) { }) } - // Release EIPs while instances are terminating (overlap the wait). + // Release EIPs. addrResp, err := h.EC2.DescribeAddresses(ctx, &ec2.DescribeAddressesInput{ Filters: []ec2types.Filter{ {Name: aws.String("tag:" + h.NATTagKey), Values: []string{h.NATTagValue}}, @@ -638,22 +472,17 @@ func (h *Handler) cleanupAll(ctx context.Context) { } } - // Wait for instance termination. if len(instanceIDs) > 0 { - for _, iid := range instanceIDs { - h.waitForTermination(ctx, iid) - } - log.Println("All NAT instances terminated") + log.Println("NAT instance termination initiated") } } // isErrCode returns true if the error (or any wrapped error) has the given -// AWS API error code. Works with both smithy APIError and legacy awserr. +// AWS API error code. func isErrCode(err error, code string) bool { var ae smithy.APIError if ok := errors.As(err, &ae); ok { return ae.ErrorCode() == code } - // Fallback: check the error string for SDKs that don't implement APIError. return strings.Contains(err.Error(), code) } diff --git a/cmd/lambda/ec2ops_test.go b/cmd/lambda/ec2ops_test.go index 281bcb9..f69bb0f 100644 --- a/cmd/lambda/ec2ops_test.go +++ b/cmd/lambda/ec2ops_test.go @@ -3,26 +3,26 @@ package main import ( "context" "fmt" - "sync/atomic" "testing" "github.com/aws/aws-sdk-go-v2/aws" "github.com/aws/aws-sdk-go-v2/service/ec2" ec2types "github.com/aws/aws-sdk-go-v2/service/ec2/types" + "github.com/aws/smithy-go" ) -// --- classify() --- +// --- resolveAZ() --- -func TestClassify(t *testing.T) { +func TestResolveAZUnit(t *testing.T) { t.Run("instance not found", func(t *testing.T) { mock := &mockEC2{} mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { return describeResponse(), nil } h := newTestHandler(mock) - ignore, isNAT, az, vpc := h.classify(context.Background(), "i-gone") - if !ignore || isNAT || az != "" || vpc != "" { - t.Errorf("expected (true, false, '', ''), got (%v, %v, %q, %q)", ignore, isNAT, az, vpc) + az, vpc := h.resolveAZ(context.Background(), "i-gone") + if az != "" || vpc != "" { + t.Errorf("expected ('', ''), got (%q, %q)", az, vpc) } }) @@ -33,9 +33,9 @@ func TestClassify(t *testing.T) { return describeResponse(inst), nil } h := newTestHandler(mock) - ignore, isNAT, _, _ := h.classify(context.Background(), "i-other") - if !ignore || isNAT { - t.Errorf("expected (true, false), got (%v, %v)", ignore, isNAT) + az, vpc := h.resolveAZ(context.Background(), "i-other") + if az != "" || vpc != "" { + t.Errorf("expected ('', ''), got (%q, %q)", az, vpc) } }) @@ -47,13 +47,13 @@ func TestClassify(t *testing.T) { return describeResponse(inst), nil } h := newTestHandler(mock) - ignore, isNAT, az, vpc := h.classify(context.Background(), "i-ign") - if !ignore || isNAT || az != testAZ || vpc != testVPC { - t.Errorf("expected (true, false, %q, %q), got (%v, %v, %q, %q)", testAZ, testVPC, ignore, isNAT, az, vpc) + az, vpc := h.resolveAZ(context.Background(), "i-ign") + if az != "" || vpc != "" { + t.Errorf("expected ('', ''), got (%q, %q)", az, vpc) } }) - t.Run("NAT instance", func(t *testing.T) { + t.Run("NAT instance resolves normally", func(t *testing.T) { mock := &mockEC2{} inst := makeTestInstance("i-nat", "running", testVPC, testAZ, []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}}, nil) @@ -61,9 +61,9 @@ func TestClassify(t *testing.T) { return describeResponse(inst), nil } h := newTestHandler(mock) - ignore, isNAT, az, vpc := h.classify(context.Background(), "i-nat") - if ignore || !isNAT || az != testAZ || vpc != testVPC { - t.Errorf("expected (false, true, %q, %q), got (%v, %v, %q, %q)", testAZ, testVPC, ignore, isNAT, az, vpc) + az, vpc := h.resolveAZ(context.Background(), "i-nat") + if az != testAZ || vpc != testVPC { + t.Errorf("expected (%q, %q), got (%q, %q)", testAZ, testVPC, az, vpc) } }) @@ -75,78 +75,73 @@ func TestClassify(t *testing.T) { return describeResponse(inst), nil } h := newTestHandler(mock) - ignore, isNAT, az, vpc := h.classify(context.Background(), "i-work") - if ignore || isNAT || az != testAZ || vpc != testVPC { - t.Errorf("expected (false, false, %q, %q), got (%v, %v, %q, %q)", testAZ, testVPC, ignore, isNAT, az, vpc) + az, vpc := h.resolveAZ(context.Background(), "i-work") + if az != testAZ || vpc != testVPC { + t.Errorf("expected (%q, %q), got (%q, %q)", testAZ, testVPC, az, vpc) } }) } -// --- waitForState() --- +// --- findWorkloads() --- -func TestWaitForState(t *testing.T) { - t.Run("already in desired state", func(t *testing.T) { - mock := &mockEC2{} - inst := makeTestInstance("i-1", "running", testVPC, testAZ, nil, nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(inst), nil - } - h := newTestHandler(mock) - if !h.waitForState(context.Background(), "i-1", []string{"running"}, 10) { - t.Error("expected true") - } - }) - - t.Run("transitions to desired state", func(t *testing.T) { +func TestFindWorkloads(t *testing.T) { + t.Run("returns workload instances", func(t *testing.T) { mock := &mockEC2{} - var idx int32 + work := makeTestInstance("i-work", "running", testVPC, testAZ, + []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("api")}}, nil) mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - i := atomic.AddInt32(&idx, 1) - if i == 1 { - return describeResponse(makeTestInstance("i-1", "pending", testVPC, testAZ, nil, nil)), nil - } - return describeResponse(makeTestInstance("i-1", "running", testVPC, testAZ, nil, nil)), nil + return describeResponse(work), nil } h := newTestHandler(mock) - if !h.waitForState(context.Background(), "i-1", []string{"running"}, 10) { - t.Error("expected true") + wl := h.findWorkloads(context.Background(), testAZ, testVPC) + if len(wl) != 1 || wl[0].InstanceID != "i-work" { + t.Errorf("expected [i-work], got %v", wl) } }) - t.Run("timeout", func(t *testing.T) { + t.Run("excludes NAT and ignored", func(t *testing.T) { mock := &mockEC2{} + work := makeTestInstance("i-work", "running", testVPC, testAZ, + []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("api")}}, nil) + nat := makeTestInstance("i-nat", "running", testVPC, testAZ, + []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}}, nil) + ignored := makeTestInstance("i-ign", "running", testVPC, testAZ, + []ec2types.Tag{{Key: aws.String("nat-zero:ignore"), Value: aws.String("true")}}, nil) mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(makeTestInstance("i-1", "pending", testVPC, testAZ, nil, nil)), nil + return describeResponse(work, nat, ignored), nil } h := newTestHandler(mock) - if h.waitForState(context.Background(), "i-1", []string{"running"}, 10) { - t.Error("expected false (timeout)") + wl := h.findWorkloads(context.Background(), testAZ, testVPC) + if len(wl) != 1 || wl[0].InstanceID != "i-work" { + t.Errorf("expected [i-work], got %v", wl) } }) - t.Run("instance disappears", func(t *testing.T) { + t.Run("no workloads", func(t *testing.T) { mock := &mockEC2{} mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { return describeResponse(), nil } h := newTestHandler(mock) - if h.waitForState(context.Background(), "i-gone", []string{"running"}, 10) { - t.Error("expected false") + wl := h.findWorkloads(context.Background(), testAZ, testVPC) + if len(wl) != 0 { + t.Errorf("expected 0 workloads, got %d", len(wl)) } }) } -// --- findNAT() --- +// --- findNATs() --- -func TestFindNAT(t *testing.T) { +func TestFindNATs(t *testing.T) { t.Run("no NATs", func(t *testing.T) { mock := &mockEC2{} mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { return describeResponse(), nil } h := newTestHandler(mock) - if h.findNAT(context.Background(), testAZ, testVPC) != nil { - t.Error("expected nil") + nats := h.findNATs(context.Background(), testAZ, testVPC) + if len(nats) != 0 { + t.Errorf("expected 0, got %d", len(nats)) } }) @@ -157,167 +152,108 @@ func TestFindNAT(t *testing.T) { return describeResponse(nat), nil } h := newTestHandler(mock) - result := h.findNAT(context.Background(), testAZ, testVPC) - if result == nil || result.InstanceID != "i-nat1" { - t.Errorf("expected i-nat1, got %v", result) + nats := h.findNATs(context.Background(), testAZ, testVPC) + if len(nats) != 1 || nats[0].InstanceID != "i-nat1" { + t.Errorf("expected [i-nat1], got %v", nats) } }) - t.Run("deduplicates keeps running", func(t *testing.T) { + t.Run("multiple NATs", func(t *testing.T) { mock := &mockEC2{} - running := makeTestInstance("i-run", "running", testVPC, testAZ, nil, nil) - stopped1 := makeTestInstance("i-stop1", "stopped", testVPC, testAZ, nil, nil) - stopped2 := makeTestInstance("i-stop2", "stopped", testVPC, testAZ, nil, nil) + nat1 := makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, nil) + nat2 := makeTestInstance("i-nat2", "stopped", testVPC, testAZ, nil, nil) mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(running, stopped1, stopped2), nil + return describeResponse(nat1, nat2), nil } h := newTestHandler(mock) - result := h.findNAT(context.Background(), testAZ, testVPC) - if result == nil || result.InstanceID != "i-run" { - t.Errorf("expected i-run, got %v", result) - } - if mock.callCount("TerminateInstances") != 2 { - t.Errorf("expected 2 TerminateInstances calls, got %d", mock.callCount("TerminateInstances")) + nats := h.findNATs(context.Background(), testAZ, testVPC) + if len(nats) != 2 { + t.Errorf("expected 2 NATs, got %d", len(nats)) } }) +} + +// --- findEIPs() --- - t.Run("deduplicates no running keeps first", func(t *testing.T) { +func TestFindEIPs(t *testing.T) { + t.Run("no EIPs", func(t *testing.T) { mock := &mockEC2{} - s1 := makeTestInstance("i-s1", "stopped", testVPC, testAZ, nil, nil) - s2 := makeTestInstance("i-s2", "stopped", testVPC, testAZ, nil, nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(s1, s2), nil + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{}, nil } h := newTestHandler(mock) - result := h.findNAT(context.Background(), testAZ, testVPC) - if result == nil || result.InstanceID != "i-s1" { - t.Errorf("expected i-s1, got %v", result) - } - if mock.callCount("TerminateInstances") != 1 { - t.Errorf("expected 1 TerminateInstances call, got %d", mock.callCount("TerminateInstances")) + eips := h.findEIPs(context.Background(), testAZ) + if len(eips) != 0 { + t.Errorf("expected 0, got %d", len(eips)) } }) - t.Run("deduplication handles terminate failure", func(t *testing.T) { + t.Run("returns tagged EIPs", func(t *testing.T) { mock := &mockEC2{} - running := makeTestInstance("i-run", "running", testVPC, testAZ, nil, nil) - extra := makeTestInstance("i-extra", "stopped", testVPC, testAZ, nil, nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(running, extra), nil - } - mock.TerminateInstancesFn = func(ctx context.Context, params *ec2.TerminateInstancesInput, optFns ...func(*ec2.Options)) (*ec2.TerminateInstancesOutput, error) { - return nil, fmt.Errorf("UnauthorizedOperation: Not allowed") + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{ + Addresses: []ec2types.Address{ + {AllocationId: aws.String("eipalloc-1")}, + {AllocationId: aws.String("eipalloc-2")}, + }, + }, nil } h := newTestHandler(mock) - result := h.findNAT(context.Background(), testAZ, testVPC) - if result == nil || result.InstanceID != "i-run" { - t.Errorf("expected i-run despite terminate failure, got %v", result) + eips := h.findEIPs(context.Background(), testAZ) + if len(eips) != 2 { + t.Errorf("expected 2, got %d", len(eips)) } }) } -// --- findSiblings() --- - -func TestFindSiblings(t *testing.T) { - t.Run("no siblings", func(t *testing.T) { - mock := &mockEC2{} - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(), nil - } - h := newTestHandler(mock) - sibs := h.findSiblings(context.Background(), testAZ, testVPC, "") - if len(sibs) != 0 { - t.Errorf("expected 0 siblings, got %d", len(sibs)) - } - }) +// --- terminateDuplicateNATs() --- - t.Run("returns workload instances", func(t *testing.T) { +func TestTerminateDuplicateNATs(t *testing.T) { + t.Run("keeps running terminates others", func(t *testing.T) { mock := &mockEC2{} - work := makeTestInstance("i-work", "running", testVPC, testAZ, - []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("api")}}, nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(work), nil - } h := newTestHandler(mock) - sibs := h.findSiblings(context.Background(), testAZ, testVPC, "") - if len(sibs) != 1 || sibs[0].InstanceID != "i-work" { - t.Errorf("expected [i-work], got %v", sibs) - } - }) - - t.Run("excludes NAT and ignored", func(t *testing.T) { - mock := &mockEC2{} - work := makeTestInstance("i-work", "running", testVPC, testAZ, - []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("api")}}, nil) - nat := makeTestInstance("i-nat", "running", testVPC, testAZ, - []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}}, nil) - ignored := makeTestInstance("i-ign", "running", testVPC, testAZ, - []ec2types.Tag{{Key: aws.String("nat-zero:ignore"), Value: aws.String("true")}}, nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(work, nat, ignored), nil + nat1 := &Instance{InstanceID: "i-nat1", StateName: "running"} + nat2 := &Instance{InstanceID: "i-nat2", StateName: "stopped"} + nat3 := &Instance{InstanceID: "i-nat3", StateName: "pending"} + result := h.terminateDuplicateNATs(context.Background(), []*Instance{nat1, nat2, nat3}) + if len(result) != 1 || result[0].InstanceID != "i-nat1" { + t.Errorf("expected [i-nat1], got %v", result) } - h := newTestHandler(mock) - sibs := h.findSiblings(context.Background(), testAZ, testVPC, "") - if len(sibs) != 1 || sibs[0].InstanceID != "i-work" { - t.Errorf("expected [i-work], got %v", sibs) + if mock.callCount("TerminateInstances") != 2 { + t.Errorf("expected 2 TerminateInstances, got %d", mock.callCount("TerminateInstances")) } }) - t.Run("excludes trigger instance", func(t *testing.T) { + t.Run("no running keeps first", func(t *testing.T) { mock := &mockEC2{} - trigger := makeTestInstance("i-dying", "running", testVPC, testAZ, - []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("api")}}, nil) - other := makeTestInstance("i-alive", "running", testVPC, testAZ, - []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}}, nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(trigger, other), nil - } h := newTestHandler(mock) - sibs := h.findSiblings(context.Background(), testAZ, testVPC, "i-dying") - if len(sibs) != 1 || sibs[0].InstanceID != "i-alive" { - t.Errorf("expected [i-alive], got %v", sibs) - } - }) - - t.Run("excludes trigger when it is only instance", func(t *testing.T) { - mock := &mockEC2{} - trigger := makeTestInstance("i-dying", "running", testVPC, testAZ, - []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("api")}}, nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(trigger), nil + nat1 := &Instance{InstanceID: "i-s1", StateName: "stopped"} + nat2 := &Instance{InstanceID: "i-s2", StateName: "stopped"} + result := h.terminateDuplicateNATs(context.Background(), []*Instance{nat1, nat2}) + if len(result) != 1 || result[0].InstanceID != "i-s1" { + t.Errorf("expected [i-s1], got %v", result) } - h := newTestHandler(mock) - sibs := h.findSiblings(context.Background(), testAZ, testVPC, "i-dying") - if len(sibs) != 0 { - t.Errorf("expected 0 siblings, got %d", len(sibs)) + if mock.callCount("TerminateInstances") != 1 { + t.Errorf("expected 1 TerminateInstances, got %d", mock.callCount("TerminateInstances")) } }) } -// --- attachEIP() --- +// --- allocateAndAttachEIP() --- -func TestAttachEIP(t *testing.T) { +func TestAllocateAndAttachEIP(t *testing.T) { t.Run("happy path", func(t *testing.T) { mock := &mockEC2{} - eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil - } mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-1"), PublicIp: aws.String("1.2.3.4")}, nil } - mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { - return &ec2.DescribeNetworkInterfacesOutput{ - NetworkInterfaces: []ec2types.NetworkInterface{{ - NetworkInterfaceId: aws.String("eni-pub1"), - }}, - }, nil - } mock.AssociateAddressFn = func(ctx context.Context, params *ec2.AssociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AssociateAddressOutput, error) { return &ec2.AssociateAddressOutput{}, nil } h := newTestHandler(mock) - h.attachEIP(context.Background(), "i-nat1", testAZ) + eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) + nat := &Instance{InstanceID: "i-nat1", NetworkInterfaces: []ec2types.InstanceNetworkInterface{eni}} + h.allocateAndAttachEIP(context.Background(), nat, testAZ) if mock.callCount("AllocateAddress") != 1 { t.Error("expected AllocateAddress") } @@ -326,15 +262,23 @@ func TestAttachEIP(t *testing.T) { } }) - t.Run("already has EIP is noop", func(t *testing.T) { + t.Run("no public ENI", func(t *testing.T) { mock := &mockEC2{} - assoc := &ec2types.InstanceNetworkInterfaceAssociation{PublicIp: aws.String("5.6.7.8")} - eni := makeENI("eni-pub1", 0, "10.0.1.10", assoc) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil + h := newTestHandler(mock) + nat := &Instance{InstanceID: "i-nat1", NetworkInterfaces: nil} + h.allocateAndAttachEIP(context.Background(), nat, testAZ) + if mock.callCount("AllocateAddress") != 0 { + t.Error("expected no AllocateAddress when no ENI") } + }) + + t.Run("ENI already has EIP", func(t *testing.T) { + mock := &mockEC2{} h := newTestHandler(mock) - h.attachEIP(context.Background(), "i-nat1", testAZ) + assoc := &ec2types.InstanceNetworkInterfaceAssociation{PublicIp: aws.String("5.6.7.8")} + eni := makeENI("eni-pub1", 0, "10.0.1.10", assoc) + nat := &Instance{InstanceID: "i-nat1", NetworkInterfaces: []ec2types.InstanceNetworkInterface{eni}} + h.allocateAndAttachEIP(context.Background(), nat, testAZ) if mock.callCount("AllocateAddress") != 0 { t.Error("expected no AllocateAddress when ENI already has EIP") } @@ -342,138 +286,47 @@ func TestAttachEIP(t *testing.T) { t.Run("allocation fails", func(t *testing.T) { mock := &mockEC2{} - eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil - } mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { return nil, fmt.Errorf("AddressLimitExceeded: Too many EIPs") } h := newTestHandler(mock) - h.attachEIP(context.Background(), "i-nat1", testAZ) + eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) + nat := &Instance{InstanceID: "i-nat1", NetworkInterfaces: []ec2types.InstanceNetworkInterface{eni}} + h.allocateAndAttachEIP(context.Background(), nat, testAZ) if mock.callCount("AssociateAddress") != 0 { - t.Error("expected AssociateAddress NOT to be called") + t.Error("expected no AssociateAddress after allocation failure") } }) t.Run("association fails releases EIP", func(t *testing.T) { mock := &mockEC2{} - eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil - } mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-1"), PublicIp: aws.String("1.2.3.4")}, nil } - mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { - return &ec2.DescribeNetworkInterfacesOutput{ - NetworkInterfaces: []ec2types.NetworkInterface{{ - NetworkInterfaceId: aws.String("eni-pub1"), - }}, - }, nil - } mock.AssociateAddressFn = func(ctx context.Context, params *ec2.AssociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AssociateAddressOutput, error) { return nil, fmt.Errorf("InvalidParameterValue: Bad param") } h := newTestHandler(mock) - h.attachEIP(context.Background(), "i-nat1", testAZ) - if mock.callCount("ReleaseAddress") != 1 { - t.Errorf("expected 1 ReleaseAddress call, got %d", mock.callCount("ReleaseAddress")) - } - }) - - t.Run("no public ENI", func(t *testing.T) { - mock := &mockEC2{} - // Instance with no ENIs - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, nil)), nil - } - h := newTestHandler(mock) - h.attachEIP(context.Background(), "i-nat1", testAZ) - if mock.callCount("AllocateAddress") != 0 { - t.Error("expected no AllocateAddress when no public ENI") - } - }) - - t.Run("race detected releases allocated EIP", func(t *testing.T) { - mock := &mockEC2{} eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil - } - mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { - return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-1"), PublicIp: aws.String("1.2.3.4")}, nil - } - // Re-check shows another invocation already attached an EIP - mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { - return &ec2.DescribeNetworkInterfacesOutput{ - NetworkInterfaces: []ec2types.NetworkInterface{{ - NetworkInterfaceId: aws.String("eni-pub1"), - Association: &ec2types.NetworkInterfaceAssociation{ - PublicIp: aws.String("9.9.9.9"), - }, - }}, - }, nil - } - h := newTestHandler(mock) - h.attachEIP(context.Background(), "i-nat1", testAZ) - if mock.callCount("AssociateAddress") != 0 { - t.Error("expected no AssociateAddress when race detected") - } + nat := &Instance{InstanceID: "i-nat1", NetworkInterfaces: []ec2types.InstanceNetworkInterface{eni}} + h.allocateAndAttachEIP(context.Background(), nat, testAZ) if mock.callCount("ReleaseAddress") != 1 { - t.Errorf("expected 1 ReleaseAddress call, got %d", mock.callCount("ReleaseAddress")) - } - }) - - t.Run("describe ENI failure still associates", func(t *testing.T) { - mock := &mockEC2{} - eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil - } - mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { - return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-1"), PublicIp: aws.String("1.2.3.4")}, nil - } - mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { - return nil, fmt.Errorf("Throttling: Rate exceeded") - } - mock.AssociateAddressFn = func(ctx context.Context, params *ec2.AssociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AssociateAddressOutput, error) { - return &ec2.AssociateAddressOutput{}, nil - } - h := newTestHandler(mock) - h.attachEIP(context.Background(), "i-nat1", testAZ) - if mock.callCount("AssociateAddress") != 1 { - t.Error("expected AssociateAddress to proceed despite describe failure") + t.Errorf("expected ReleaseAddress=1, got %d", mock.callCount("ReleaseAddress")) } }) } -// --- detachEIP() --- +// --- releaseEIPs() --- -func TestDetachEIP(t *testing.T) { - t.Run("happy path", func(t *testing.T) { +func TestReleaseEIPs(t *testing.T) { + t.Run("releases with disassociate", func(t *testing.T) { mock := &mockEC2{} - eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(makeTestInstance("i-nat1", "stopped", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil - } - mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { - return &ec2.DescribeNetworkInterfacesOutput{ - NetworkInterfaces: []ec2types.NetworkInterface{{ - NetworkInterfaceId: aws.String("eni-pub1"), - Association: &ec2types.NetworkInterfaceAssociation{ - AssociationId: aws.String("eipassoc-1"), - AllocationId: aws.String("eipalloc-1"), - PublicIp: aws.String("1.2.3.4"), - }, - }}, - }, nil - } - mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { - return &ec2.DescribeAddressesOutput{Addresses: []ec2types.Address{}}, nil - } h := newTestHandler(mock) - h.detachEIP(context.Background(), "i-nat1", testAZ) + eips := []ec2types.Address{{ + AllocationId: aws.String("eipalloc-1"), + AssociationId: aws.String("eipassoc-1"), + }} + h.releaseEIPs(context.Background(), eips) if mock.callCount("DisassociateAddress") != 1 { t.Error("expected DisassociateAddress") } @@ -482,83 +335,33 @@ func TestDetachEIP(t *testing.T) { } }) - t.Run("no association is noop", func(t *testing.T) { + t.Run("releases without association", func(t *testing.T) { mock := &mockEC2{} - eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(makeTestInstance("i-nat1", "stopped", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil - } - mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { - return &ec2.DescribeNetworkInterfacesOutput{ - NetworkInterfaces: []ec2types.NetworkInterface{{ - NetworkInterfaceId: aws.String("eni-pub1"), - }}, - }, nil - } - mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { - return &ec2.DescribeAddressesOutput{Addresses: []ec2types.Address{}}, nil - } h := newTestHandler(mock) - h.detachEIP(context.Background(), "i-nat1", testAZ) + eips := []ec2types.Address{{AllocationId: aws.String("eipalloc-1")}} + h.releaseEIPs(context.Background(), eips) if mock.callCount("DisassociateAddress") != 0 { - t.Error("expected DisassociateAddress NOT to be called") - } - }) - - t.Run("cleans up orphaned EIPs", func(t *testing.T) { - mock := &mockEC2{} - eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(makeTestInstance("i-nat1", "stopped", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil - } - mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { - return &ec2.DescribeNetworkInterfacesOutput{ - NetworkInterfaces: []ec2types.NetworkInterface{{ - NetworkInterfaceId: aws.String("eni-pub1"), - Association: &ec2types.NetworkInterfaceAssociation{ - AssociationId: aws.String("eipassoc-1"), - AllocationId: aws.String("eipalloc-1"), - PublicIp: aws.String("1.2.3.4"), - }, - }}, - }, nil - } - mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { - return &ec2.DescribeAddressesOutput{ - Addresses: []ec2types.Address{{ - AllocationId: aws.String("eipalloc-orphan"), - }}, - }, nil + t.Error("expected no DisassociateAddress") } - h := newTestHandler(mock) - h.detachEIP(context.Background(), "i-nat1", testAZ) - // 1 from current association + 1 from orphan sweep - if mock.callCount("ReleaseAddress") != 2 { - t.Errorf("expected 2 ReleaseAddress calls, got %d", mock.callCount("ReleaseAddress")) + if mock.callCount("ReleaseAddress") != 1 { + t.Error("expected ReleaseAddress") } }) - t.Run("orphan sweep error is non-fatal", func(t *testing.T) { + t.Run("handles InvalidAssociationID.NotFound", func(t *testing.T) { mock := &mockEC2{} - eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(makeTestInstance("i-nat1", "stopped", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil - } - mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { - return &ec2.DescribeNetworkInterfacesOutput{ - NetworkInterfaces: []ec2types.NetworkInterface{{ - NetworkInterfaceId: aws.String("eni-pub1"), - }}, - }, nil - } - mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { - return nil, fmt.Errorf("Throttling: Rate exceeded") + mock.DisassociateAddressFn = func(ctx context.Context, params *ec2.DisassociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.DisassociateAddressOutput, error) { + return nil, &apiError{code: "InvalidAssociationID.NotFound", message: "Not found"} } h := newTestHandler(mock) - // Should not panic - h.detachEIP(context.Background(), "i-nat1", testAZ) - if mock.callCount("DisassociateAddress") != 0 { - t.Error("expected no DisassociateAddress when no ENI association") + eips := []ec2types.Address{{ + AllocationId: aws.String("eipalloc-1"), + AssociationId: aws.String("eipassoc-stale"), + }} + h.releaseEIPs(context.Background(), eips) + // Should still release despite disassociate NotFound + if mock.callCount("ReleaseAddress") != 1 { + t.Errorf("expected ReleaseAddress=1, got %d", mock.callCount("ReleaseAddress")) } }) } @@ -581,7 +384,7 @@ func TestCreateNAT(t *testing.T) { } } - t.Run("happy path without inline EIP", func(t *testing.T) { + t.Run("happy path", func(t *testing.T) { mock := &mockEC2{} setupLTAndAMI(mock) mock.DescribeImagesFn = func(ctx context.Context, params *ec2.DescribeImagesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeImagesOutput, error) { @@ -603,10 +406,6 @@ func TestCreateNAT(t *testing.T) { if result != "i-new1" { t.Errorf("expected i-new1, got %s", result) } - // No inline EIP — that's handled by EventBridge now - if mock.callCount("AllocateAddress") != 0 { - t.Error("expected no AllocateAddress (EIP managed via EventBridge)") - } }) t.Run("no launch template", func(t *testing.T) { @@ -621,205 +420,88 @@ func TestCreateNAT(t *testing.T) { } }) - t.Run("AMI lookup fails uses template default", func(t *testing.T) { + t.Run("run instances fails", func(t *testing.T) { mock := &mockEC2{} setupLTAndAMI(mock) mock.DescribeImagesFn = func(ctx context.Context, params *ec2.DescribeImagesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeImagesOutput, error) { - return nil, fmt.Errorf("InvalidParameterValue: Bad filter") + return &ec2.DescribeImagesOutput{Images: []ec2types.Image{}}, nil } mock.RunInstancesFn = func(ctx context.Context, params *ec2.RunInstancesInput, optFns ...func(*ec2.Options)) (*ec2.RunInstancesOutput, error) { - if params.ImageId != nil { - t.Error("expected no ImageId when AMI lookup fails") - } - return &ec2.RunInstancesOutput{ - Instances: []ec2types.Instance{{InstanceId: aws.String("i-new2")}}, - }, nil + return nil, fmt.Errorf("InsufficientInstanceCapacity: No capacity") } h := newTestHandler(mock) result := h.createNAT(context.Background(), testAZ, testVPC) - if result != "i-new2" { - t.Errorf("expected i-new2, got %s", result) + if result != "" { + t.Errorf("expected empty, got %s", result) } }) - t.Run("no images found uses template default", func(t *testing.T) { + t.Run("config version tag included", func(t *testing.T) { mock := &mockEC2{} setupLTAndAMI(mock) mock.DescribeImagesFn = func(ctx context.Context, params *ec2.DescribeImagesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeImagesOutput, error) { return &ec2.DescribeImagesOutput{Images: []ec2types.Image{}}, nil } mock.RunInstancesFn = func(ctx context.Context, params *ec2.RunInstancesInput, optFns ...func(*ec2.Options)) (*ec2.RunInstancesOutput, error) { + if len(params.TagSpecifications) == 0 { + t.Error("expected TagSpecifications") + } else { + found := false + for _, tag := range params.TagSpecifications[0].Tags { + if aws.ToString(tag.Key) == "ConfigVersion" && aws.ToString(tag.Value) == "abc123" { + found = true + } + } + if !found { + t.Error("expected ConfigVersion tag") + } + } return &ec2.RunInstancesOutput{ - Instances: []ec2types.Instance{{InstanceId: aws.String("i-new3")}}, + Instances: []ec2types.Instance{{InstanceId: aws.String("i-tagged")}}, }, nil } h := newTestHandler(mock) - result := h.createNAT(context.Background(), testAZ, testVPC) - if result != "i-new3" { - t.Errorf("expected i-new3, got %s", result) - } - }) - - t.Run("run instances fails", func(t *testing.T) { - mock := &mockEC2{} - setupLTAndAMI(mock) - mock.DescribeImagesFn = func(ctx context.Context, params *ec2.DescribeImagesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeImagesOutput, error) { - return &ec2.DescribeImagesOutput{Images: []ec2types.Image{}}, nil - } - mock.RunInstancesFn = func(ctx context.Context, params *ec2.RunInstancesInput, optFns ...func(*ec2.Options)) (*ec2.RunInstancesOutput, error) { - return nil, fmt.Errorf("InsufficientInstanceCapacity: No capacity") - } - h := newTestHandler(mock) - result := h.createNAT(context.Background(), testAZ, testVPC) - if result != "" { - t.Errorf("expected empty, got %s", result) - } + h.ConfigVersion = "abc123" + h.createNAT(context.Background(), testAZ, testVPC) }) } -// --- startNAT() --- +// --- isCurrentConfig() --- -func TestStartNAT(t *testing.T) { - t.Run("happy path", func(t *testing.T) { - mock := &mockEC2{} - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(makeTestInstance("i-nat1", "stopped", testVPC, testAZ, nil, nil)), nil - } - mock.StartInstancesFn = func(ctx context.Context, params *ec2.StartInstancesInput, optFns ...func(*ec2.Options)) (*ec2.StartInstancesOutput, error) { - return &ec2.StartInstancesOutput{}, nil - } - h := newTestHandler(mock) - h.startNAT(context.Background(), &Instance{InstanceID: "i-nat1"}, testAZ) - if mock.callCount("StartInstances") != 1 { - t.Error("expected StartInstances to be called") - } - // No inline EIP — that's handled by EventBridge now - if mock.callCount("AllocateAddress") != 0 { - t.Error("expected no AllocateAddress (EIP managed via EventBridge)") - } - }) - - t.Run("wait timeout", func(t *testing.T) { - mock := &mockEC2{} - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(makeTestInstance("i-nat1", "stopping", testVPC, testAZ, nil, nil)), nil - } - h := newTestHandler(mock) - h.startNAT(context.Background(), &Instance{InstanceID: "i-nat1"}, testAZ) - if mock.callCount("StartInstances") != 0 { - t.Error("expected StartInstances NOT to be called after timeout") - } - }) -} - -// --- stopNAT() --- - -func TestStopNAT(t *testing.T) { - t.Run("happy path uses force stop", func(t *testing.T) { - mock := &mockEC2{} - mock.StopInstancesFn = func(ctx context.Context, params *ec2.StopInstancesInput, optFns ...func(*ec2.Options)) (*ec2.StopInstancesOutput, error) { - if params.Force == nil || !*params.Force { - t.Error("expected Force=true in StopInstances") - } - return &ec2.StopInstancesOutput{}, nil - } - h := newTestHandler(mock) - h.stopNAT(context.Background(), &Instance{InstanceID: "i-nat1"}) - if mock.callCount("StopInstances") != 1 { - t.Error("expected StopInstances to be called") - } - // No inline EIP release — that's handled by EventBridge now - if mock.callCount("DisassociateAddress") != 0 { - t.Error("expected no DisassociateAddress (EIP managed via EventBridge)") +func TestIsCurrentConfig(t *testing.T) { + t.Run("matching config", func(t *testing.T) { + h := newTestHandler(nil) + h.ConfigVersion = "abc123" + inst := &Instance{Tags: []ec2types.Tag{{Key: aws.String("ConfigVersion"), Value: aws.String("abc123")}}} + if !h.isCurrentConfig(inst) { + t.Error("expected true") } }) - t.Run("stop fails", func(t *testing.T) { - mock := &mockEC2{} - mock.StopInstancesFn = func(ctx context.Context, params *ec2.StopInstancesInput, optFns ...func(*ec2.Options)) (*ec2.StopInstancesOutput, error) { - return nil, fmt.Errorf("IncorrectInstanceState: Already stopping") - } - h := newTestHandler(mock) - h.stopNAT(context.Background(), &Instance{InstanceID: "i-nat1"}) - }) -} - -// --- sweepIdleNATs() --- - -func TestSweepIdleNATs(t *testing.T) { - natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} - - t.Run("stops NAT with no siblings", func(t *testing.T) { - mock := &mockEC2{} - natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) - var callIdx int32 - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&callIdx, 1) - if idx == 1 { - // sweep query: find running NATs - return describeResponse(natInst), nil - } - // findSiblings query: no siblings - return describeResponse(), nil - } - h := newTestHandler(mock) - h.sweepIdleNATs(context.Background(), "i-trigger") - if mock.callCount("StopInstances") != 1 { - t.Errorf("expected StopInstances to be called once, got %d", mock.callCount("StopInstances")) - } - }) - - t.Run("keeps NAT with siblings", func(t *testing.T) { - mock := &mockEC2{} - natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) - sibInst := makeTestInstance("i-sib1", "running", testVPC, testAZ, - []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}}, nil) - var callIdx int32 - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&callIdx, 1) - if idx == 1 { - return describeResponse(natInst), nil - } - return describeResponse(sibInst), nil - } - h := newTestHandler(mock) - h.sweepIdleNATs(context.Background(), "i-trigger") - if mock.callCount("StopInstances") != 0 { - t.Error("expected StopInstances NOT to be called") + t.Run("mismatched config", func(t *testing.T) { + h := newTestHandler(nil) + h.ConfigVersion = "abc123" + inst := &Instance{Tags: []ec2types.Tag{{Key: aws.String("ConfigVersion"), Value: aws.String("old456")}}} + if h.isCurrentConfig(inst) { + t.Error("expected false") } }) - t.Run("excludes trigger from siblings", func(t *testing.T) { - mock := &mockEC2{} - natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) - // The trigger instance still appears as running (EC2 eventual consistency) - triggerInst := makeTestInstance("i-trigger", "running", testVPC, testAZ, - []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}}, nil) - var callIdx int32 - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&callIdx, 1) - if idx == 1 { - return describeResponse(natInst), nil - } - // findSiblings returns the trigger instance (but it should be excluded) - return describeResponse(triggerInst), nil - } - h := newTestHandler(mock) - h.sweepIdleNATs(context.Background(), "i-trigger") - if mock.callCount("StopInstances") != 1 { - t.Errorf("expected StopInstances (trigger should be excluded), got %d", mock.callCount("StopInstances")) + t.Run("no tag assumes current", func(t *testing.T) { + h := newTestHandler(nil) + h.ConfigVersion = "abc123" + inst := &Instance{Tags: []ec2types.Tag{{Key: aws.String("Name"), Value: aws.String("nat")}}} + if !h.isCurrentConfig(inst) { + t.Error("expected true — missing tag means nothing to compare") } }) - t.Run("no running NATs is noop", func(t *testing.T) { - mock := &mockEC2{} - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(), nil - } - h := newTestHandler(mock) - h.sweepIdleNATs(context.Background(), "i-trigger") - if mock.callCount("StopInstances") != 0 { - t.Error("expected no StopInstances calls") + t.Run("empty config version skips check", func(t *testing.T) { + h := newTestHandler(nil) + h.ConfigVersion = "" + inst := &Instance{Tags: []ec2types.Tag{}} + if !h.isCurrentConfig(inst) { + t.Error("expected true") } }) } @@ -831,13 +513,8 @@ func TestCleanupAll(t *testing.T) { mock := &mockEC2{} nat1 := makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, nil) nat2 := makeTestInstance("i-nat2", "stopped", testVPC, testAZ, nil, nil) - var describeIdx int32 mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&describeIdx, 1) - if idx == 1 { - return describeResponse(nat1, nat2), nil - } - return describeResponse(makeTestInstance("i-nat1", "terminated", testVPC, testAZ, nil, nil)), nil + return describeResponse(nat1, nat2), nil } mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { return &ec2.DescribeAddressesOutput{ @@ -850,30 +527,13 @@ func TestCleanupAll(t *testing.T) { h := newTestHandler(mock) h.cleanupAll(context.Background()) if mock.callCount("TerminateInstances") != 1 { - t.Errorf("expected 1 TerminateInstances call, got %d", mock.callCount("TerminateInstances")) + t.Errorf("expected 1 TerminateInstances, got %d", mock.callCount("TerminateInstances")) } if mock.callCount("DisassociateAddress") != 1 { - t.Error("expected DisassociateAddress to be called") - } - if mock.callCount("ReleaseAddress") != 1 { - t.Error("expected ReleaseAddress to be called") - } - }) - - t.Run("no instances still cleans EIPs", func(t *testing.T) { - mock := &mockEC2{} - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(), nil - } - mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { - return &ec2.DescribeAddressesOutput{ - Addresses: []ec2types.Address{{AllocationId: aws.String("eipalloc-1")}}, - }, nil + t.Error("expected DisassociateAddress") } - h := newTestHandler(mock) - h.cleanupAll(context.Background()) if mock.callCount("ReleaseAddress") != 1 { - t.Error("expected ReleaseAddress to be called") + t.Error("expected ReleaseAddress") } }) @@ -888,267 +548,41 @@ func TestCleanupAll(t *testing.T) { h := newTestHandler(mock) h.cleanupAll(context.Background()) if mock.callCount("TerminateInstances") != 0 { - t.Error("expected no TerminateInstances calls") - } - }) - - t.Run("EIP release failure continues", func(t *testing.T) { - mock := &mockEC2{} - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(), nil - } - mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { - return &ec2.DescribeAddressesOutput{ - Addresses: []ec2types.Address{ - {AllocationId: aws.String("eipalloc-1")}, - {AllocationId: aws.String("eipalloc-2")}, - }, - }, nil - } - var releaseIdx int32 - mock.ReleaseAddressFn = func(ctx context.Context, params *ec2.ReleaseAddressInput, optFns ...func(*ec2.Options)) (*ec2.ReleaseAddressOutput, error) { - idx := atomic.AddInt32(&releaseIdx, 1) - if idx == 1 { - return nil, fmt.Errorf("InvalidAddress.NotFound: Not found") - } - return &ec2.ReleaseAddressOutput{}, nil - } - h := newTestHandler(mock) - h.cleanupAll(context.Background()) - if mock.callCount("ReleaseAddress") != 2 { - t.Errorf("expected 2 ReleaseAddress calls, got %d", mock.callCount("ReleaseAddress")) + t.Error("expected no TerminateInstances") } }) } -// --- isCurrentConfig() --- +// --- isErrCode() --- -func TestIsCurrentConfig(t *testing.T) { - t.Run("matching config", func(t *testing.T) { - h := newTestHandler(nil) - h.ConfigVersion = "abc123" - inst := &Instance{Tags: []ec2types.Tag{{Key: aws.String("ConfigVersion"), Value: aws.String("abc123")}}} - if !h.isCurrentConfig(inst) { - t.Error("expected true") - } - }) - - t.Run("mismatched config", func(t *testing.T) { - h := newTestHandler(nil) - h.ConfigVersion = "abc123" - inst := &Instance{Tags: []ec2types.Tag{{Key: aws.String("ConfigVersion"), Value: aws.String("old456")}}} - if h.isCurrentConfig(inst) { - t.Error("expected false") - } - }) - - t.Run("no tag assumes current", func(t *testing.T) { - h := newTestHandler(nil) - h.ConfigVersion = "abc123" - inst := &Instance{Tags: []ec2types.Tag{{Key: aws.String("Name"), Value: aws.String("nat")}}} - if !h.isCurrentConfig(inst) { - t.Error("expected true — missing tag means nothing to compare") - } - }) - - t.Run("no tags at all assumes current", func(t *testing.T) { - h := newTestHandler(nil) - h.ConfigVersion = "abc123" - inst := &Instance{Tags: []ec2types.Tag{}} - if !h.isCurrentConfig(inst) { - t.Error("expected true — missing tag means nothing to compare") - } - }) - - t.Run("empty config version skips check", func(t *testing.T) { - h := newTestHandler(nil) - h.ConfigVersion = "" - inst := &Instance{Tags: []ec2types.Tag{}} - if !h.isCurrentConfig(inst) { - t.Error("expected true") - } - }) +// apiError implements smithy.APIError for test use. +type apiError struct { + code string + message string } -// --- replaceNAT() --- - -func TestReplaceNAT(t *testing.T) { - setupLTAndAMI := func(mock *mockEC2) { - mock.DescribeLaunchTemplatesFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplatesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplatesOutput, error) { - return &ec2.DescribeLaunchTemplatesOutput{ - LaunchTemplates: []ec2types.LaunchTemplate{{LaunchTemplateId: aws.String("lt-123")}}, - }, nil - } - mock.DescribeLaunchTemplateVersionsFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplateVersionsInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplateVersionsOutput, error) { - return &ec2.DescribeLaunchTemplateVersionsOutput{ - LaunchTemplateVersions: []ec2types.LaunchTemplateVersion{{ - LaunchTemplateId: aws.String("lt-123"), VersionNumber: aws.Int64(1), - }}, - }, nil - } - mock.DescribeImagesFn = func(ctx context.Context, params *ec2.DescribeImagesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeImagesOutput, error) { - return &ec2.DescribeImagesOutput{Images: []ec2types.Image{}}, nil - } - mock.RunInstancesFn = func(ctx context.Context, params *ec2.RunInstancesInput, optFns ...func(*ec2.Options)) (*ec2.RunInstancesOutput, error) { - return &ec2.RunInstancesOutput{ - Instances: []ec2types.Instance{{InstanceId: aws.String("i-new")}}, - }, nil - } - } - - t.Run("happy path", func(t *testing.T) { - mock := &mockEC2{} - setupLTAndAMI(mock) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(makeTestInstance("i-old", "terminated", testVPC, testAZ, nil, nil)), nil - } - mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { - return &ec2.DescribeNetworkInterfacesOutput{ - NetworkInterfaces: []ec2types.NetworkInterface{{ - NetworkInterfaceId: aws.String("eni-1"), - Status: ec2types.NetworkInterfaceStatusAvailable, - }}, - }, nil - } - h := newTestHandler(mock) - eni := makeENI("eni-1", 0, "10.0.1.10", nil) - inst := &Instance{ - InstanceID: "i-old", - StateName: "running", - NetworkInterfaces: []ec2types.InstanceNetworkInterface{eni}, - } - result := h.replaceNAT(context.Background(), inst, testAZ, testVPC) - if result != "i-new" { - t.Errorf("expected i-new, got %s", result) - } - if mock.callCount("TerminateInstances") != 1 { - t.Error("expected TerminateInstances to be called") - } - }) - - t.Run("ENI wait polls until available", func(t *testing.T) { - mock := &mockEC2{} - setupLTAndAMI(mock) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(makeTestInstance("i-old", "terminated", testVPC, testAZ, nil, nil)), nil - } - var niIdx int32 - mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { - idx := atomic.AddInt32(&niIdx, 1) - if idx == 1 { - return &ec2.DescribeNetworkInterfacesOutput{ - NetworkInterfaces: []ec2types.NetworkInterface{ - {NetworkInterfaceId: aws.String("eni-1"), Status: ec2types.NetworkInterfaceStatusInUse}, - {NetworkInterfaceId: aws.String("eni-2"), Status: ec2types.NetworkInterfaceStatusInUse}, - }, - }, nil - } - return &ec2.DescribeNetworkInterfacesOutput{ - NetworkInterfaces: []ec2types.NetworkInterface{ - {NetworkInterfaceId: aws.String("eni-1"), Status: ec2types.NetworkInterfaceStatusAvailable}, - {NetworkInterfaceId: aws.String("eni-2"), Status: ec2types.NetworkInterfaceStatusAvailable}, - }, - }, nil - } - h := newTestHandler(mock) - eni1 := makeENI("eni-1", 0, "10.0.1.10", nil) - eni2 := makeENI("eni-2", 1, "10.0.1.11", nil) - inst := &Instance{ - InstanceID: "i-old", - StateName: "running", - NetworkInterfaces: []ec2types.InstanceNetworkInterface{eni1, eni2}, - } - result := h.replaceNAT(context.Background(), inst, testAZ, testVPC) - if result != "i-new" { - t.Errorf("expected i-new, got %s", result) - } - if mock.callCount("DescribeNetworkInterfaces") != 2 { - t.Errorf("expected 2 DescribeNetworkInterfaces calls, got %d", mock.callCount("DescribeNetworkInterfaces")) - } - }) - - t.Run("no ENIs skips wait", func(t *testing.T) { - mock := &mockEC2{} - setupLTAndAMI(mock) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(makeTestInstance("i-old", "terminated", testVPC, testAZ, nil, nil)), nil - } - h := newTestHandler(mock) - inst := &Instance{ - InstanceID: "i-old", - StateName: "running", - NetworkInterfaces: nil, - } - result := h.replaceNAT(context.Background(), inst, testAZ, testVPC) - if result != "i-new" { - t.Errorf("expected i-new, got %s", result) - } - if mock.callCount("DescribeNetworkInterfaces") != 0 { - t.Error("expected no DescribeNetworkInterfaces calls") - } - }) -} +func (e *apiError) Error() string { return e.message } +func (e *apiError) ErrorCode() string { return e.code } +func (e *apiError) ErrorMessage() string { return e.message } +func (e *apiError) ErrorFault() smithy.ErrorFault { return smithy.FaultServer } -// --- createNAT() config tag --- +var _ smithy.APIError = (*apiError)(nil) -func TestCreateNATConfigTag(t *testing.T) { - setupLTAndAMI := func(mock *mockEC2) { - mock.DescribeLaunchTemplatesFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplatesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplatesOutput, error) { - return &ec2.DescribeLaunchTemplatesOutput{ - LaunchTemplates: []ec2types.LaunchTemplate{{LaunchTemplateId: aws.String("lt-123")}}, - }, nil - } - mock.DescribeLaunchTemplateVersionsFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplateVersionsInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplateVersionsOutput, error) { - return &ec2.DescribeLaunchTemplateVersionsOutput{ - LaunchTemplateVersions: []ec2types.LaunchTemplateVersion{{ - LaunchTemplateId: aws.String("lt-123"), VersionNumber: aws.Int64(1), - }}, - }, nil - } - mock.DescribeImagesFn = func(ctx context.Context, params *ec2.DescribeImagesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeImagesOutput, error) { - return &ec2.DescribeImagesOutput{Images: []ec2types.Image{}}, nil +func TestIsErrCode(t *testing.T) { + t.Run("smithy API error", func(t *testing.T) { + err := &apiError{code: "InvalidAssociationID.NotFound", message: "not found"} + if !isErrCode(err, "InvalidAssociationID.NotFound") { + t.Error("expected true") } - } - - t.Run("includes config version tag", func(t *testing.T) { - mock := &mockEC2{} - setupLTAndAMI(mock) - mock.RunInstancesFn = func(ctx context.Context, params *ec2.RunInstancesInput, optFns ...func(*ec2.Options)) (*ec2.RunInstancesOutput, error) { - if len(params.TagSpecifications) == 0 { - t.Error("expected TagSpecifications") - } else { - found := false - for _, tag := range params.TagSpecifications[0].Tags { - if aws.ToString(tag.Key) == "ConfigVersion" && aws.ToString(tag.Value) == "abc123" { - found = true - } - } - if !found { - t.Error("expected ConfigVersion tag") - } - } - return &ec2.RunInstancesOutput{ - Instances: []ec2types.Instance{{InstanceId: aws.String("i-tagged")}}, - }, nil + if isErrCode(err, "SomeOtherCode") { + t.Error("expected false") } - h := newTestHandler(mock) - h.ConfigVersion = "abc123" - h.createNAT(context.Background(), testAZ, testVPC) }) - t.Run("no tag when config version empty", func(t *testing.T) { - mock := &mockEC2{} - setupLTAndAMI(mock) - mock.RunInstancesFn = func(ctx context.Context, params *ec2.RunInstancesInput, optFns ...func(*ec2.Options)) (*ec2.RunInstancesOutput, error) { - if len(params.TagSpecifications) != 0 { - t.Error("expected no TagSpecifications") - } - return &ec2.RunInstancesOutput{ - Instances: []ec2types.Instance{{InstanceId: aws.String("i-notag")}}, - }, nil + t.Run("string fallback", func(t *testing.T) { + err := fmt.Errorf("InvalidAssociationID.NotFound: blah") + if !isErrCode(err, "InvalidAssociationID.NotFound") { + t.Error("expected true") } - h := newTestHandler(mock) - h.ConfigVersion = "" - h.createNAT(context.Background(), testAZ, testVPC) }) } diff --git a/cmd/lambda/handler.go b/cmd/lambda/handler.go index fba4eac..a100b4a 100644 --- a/cmd/lambda/handler.go +++ b/cmd/lambda/handler.go @@ -3,7 +3,6 @@ package main import ( "context" "log" - "time" ) // Event is the Lambda input payload. @@ -24,9 +23,6 @@ type Handler struct { AMIOwner string AMIPattern string ConfigVersion string - - // SleepFunc can be replaced in tests to eliminate real waits. - SleepFunc func(time.Duration) } // HandleRequest is the Lambda entry point. @@ -42,120 +38,122 @@ func (h *Handler) handle(ctx context.Context, event Event) error { return nil } - iid, state := event.InstanceID, event.State - log.Printf("instance=%s state=%s", iid, state) - - ignore, isNAT, az, vpc := h.classify(ctx, iid) - if ignore { - // If the instance can no longer be found (e.g. terminated and gone - // from the API), fall back to a VPC-wide sweep so we don't miss the - // scale-down opportunity. - if isTerminating(state) { - log.Printf("Instance %s gone (state=%s), sweeping for idle NATs", iid, state) - h.sweepIdleNATs(ctx, iid) - } - return nil - } - - // NAT events → manage EIP via EventBridge - if isNAT { - if isStarting(state) { - h.attachEIP(ctx, iid, az) - } else if isStopping(state) { - h.detachEIP(ctx, iid, az) - } else if isTerminating(state) { - // R11: NAT terminated without a stop cycle (e.g. replaceNAT, - // spot reclaim, manual termination). The stopping/stopped events - // that trigger detachEIP will never fire, so sweep orphan EIPs. - h.sweepOrphanEIPs(ctx, az) - } - return nil - } + log.Printf("instance=%s state=%s", event.InstanceID, event.State) - // Workload events → manage NAT lifecycle - nat := h.findNAT(ctx, az, vpc) - - if isStarting(state) { - h.ensureNAT(ctx, nat, az, vpc) + az, vpc := h.resolveAZ(ctx, event.InstanceID) + if az == "" { + // Instance gone from API or wrong VPC/ignored — sweep all AZs. + h.sweepAllAZs(ctx) return nil } - if isStopping(state) || isTerminating(state) { - h.maybeStopNAT(ctx, nat, az, vpc, iid) - } + h.reconcile(ctx, az, vpc) return nil } -// ensureNAT ensures a NAT instance is running in the given AZ. -func (h *Handler) ensureNAT(ctx context.Context, nat *Instance, az, vpc string) { - if nat == nil || isTerminating(nat.StateName) { - if nat != nil { - log.Printf("NAT %s terminated, creating new", nat.InstanceID) - } else { - log.Printf("Creating NAT in %s", az) - } - h.createNAT(ctx, az, vpc) - return +// resolveAZ looks up the trigger instance to determine which AZ to reconcile. +// Returns ("", "") if the instance is gone, wrong VPC, or has the ignore tag. +func (h *Handler) resolveAZ(ctx context.Context, instanceID string) (az, vpc string) { + defer timed("resolve_az")() + inst := h.getInstance(ctx, instanceID) + if inst == nil { + return "", "" } - if !h.isCurrentConfig(nat) { - log.Printf("NAT %s has outdated config, replacing", nat.InstanceID) - h.replaceNAT(ctx, nat, az, vpc) - return + if inst.VpcID != h.TargetVPC { + return "", "" } - if isStopping(nat.StateName) { - log.Printf("Starting NAT %s", nat.InstanceID) - h.startNAT(ctx, nat, az) + if hasTag(inst.Tags, h.IgnoreTagKey, h.IgnoreTagValue) { + return "", "" } + return inst.AZ, inst.VpcID } -// maybeStopNAT stops the NAT if no sibling workloads remain. -// triggerID is the instance whose state change triggered this check; it is -// excluded from the sibling query so that a dying workload doesn't count -// itself as a reason to keep the NAT alive. -func (h *Handler) maybeStopNAT(ctx context.Context, nat *Instance, az, vpc, triggerID string) { - if nat == nil { - return +// sweepAllAZs reconciles every AZ that has a launch template configured. +func (h *Handler) sweepAllAZs(ctx context.Context) { + defer timed("sweep_all_azs")() + azs := h.findConfiguredAZs(ctx) + for _, az := range azs { + h.reconcile(ctx, az, h.TargetVPC) } - // Retry to let EC2 API eventual consistency settle. - // Sleep before each check so DescribeInstances reflects the latest state. - var siblings []*Instance - for attempt := 0; attempt < 3; attempt++ { - if attempt > 0 { - h.sleep(2 * time.Second) +} + +// reconcile observes the current state of workloads, NAT, and EIPs in an AZ, +// then takes at most one mutating action to converge toward the desired state. +func (h *Handler) reconcile(ctx context.Context, az, vpc string) { + defer timed("reconcile")() + + workloads := h.findWorkloads(ctx, az, vpc) + nats := h.findNATs(ctx, az, vpc) + eips := h.findEIPs(ctx, az) + + needNAT := len(workloads) > 0 + + // --- Duplicate NAT cleanup (before anything else) --- + if len(nats) > 1 { + nats = h.terminateDuplicateNATs(ctx, nats) + } + + var nat *Instance + if len(nats) > 0 { + nat = nats[0] + } + + // --- NAT convergence (one action per invocation) --- + if needNAT { + if nat == nil || nat.StateName == "shutting-down" || nat.StateName == "terminated" { + log.Printf("Creating NAT in %s (workloads=%d)", az, len(workloads)) + h.createNAT(ctx, az, vpc) + return } - siblings = h.findSiblings(ctx, az, vpc, triggerID) - if len(siblings) == 0 { - break + if !h.isCurrentConfig(nat) { + log.Printf("NAT %s has outdated config, terminating for replacement", nat.InstanceID) + h.terminateInstance(ctx, nat.InstanceID) + return } - log.Printf("Siblings found in %s (attempt %d/3), rechecking", az, attempt+1) + if nat.StateName == "stopped" { + log.Printf("Starting NAT %s", nat.InstanceID) + h.startInstance(ctx, nat.InstanceID) + return + } + if nat.StateName == "stopping" { + log.Printf("NAT %s is stopping, waiting for next event", nat.InstanceID) + return + } + // nat is pending or running — good + } else { + if nat != nil && (nat.StateName == "running" || nat.StateName == "pending") { + log.Printf("No workloads in %s, stopping NAT %s", az, nat.InstanceID) + h.stopInstance(ctx, nat.InstanceID) + return + } + // nat is stopping/stopped/nil — good } - if len(siblings) > 0 { - log.Printf("Siblings still running in %s after retries, keeping NAT", az) + + // --- EIP convergence --- + natRunning := nat != nil && nat.StateName == "running" + if natRunning && len(eips) == 0 { + log.Printf("NAT %s running with no EIP, allocating", nat.InstanceID) + h.allocateAndAttachEIP(ctx, nat, az) return } - - if isStarting(nat.StateName) { - log.Printf("No siblings, stopping NAT %s", nat.InstanceID) - h.stopNAT(ctx, nat) + if !natRunning && len(eips) > 0 { + log.Printf("NAT not running, releasing %d EIP(s) in %s", len(eips), az) + h.releaseEIPs(ctx, eips) + return } -} - -func (h *Handler) sleep(d time.Duration) { - if h.SleepFunc != nil { - h.SleepFunc(d) + if len(eips) > 1 { + log.Printf("Multiple EIPs (%d) in %s, releasing extras", len(eips), az) + h.releaseEIPs(ctx, eips[1:]) return } - time.Sleep(d) -} -func isStarting(state string) bool { - return state == "pending" || state == "running" + log.Printf("Reconcile %s: converged (workloads=%d, nat=%s, eips=%d)", + az, len(workloads), natState(nat), len(eips)) } -func isStopping(state string) bool { - return state == "stopping" || state == "stopped" -} - -func isTerminating(state string) bool { - return state == "shutting-down" || state == "terminated" +func natState(nat *Instance) string { + if nat == nil { + return "none" + } + return nat.StateName } diff --git a/cmd/lambda/handler_test.go b/cmd/lambda/handler_test.go index 931d236..7dfce1b 100644 --- a/cmd/lambda/handler_test.go +++ b/cmd/lambda/handler_test.go @@ -2,7 +2,6 @@ package main import ( "context" - "sync/atomic" "testing" "github.com/aws/aws-sdk-go-v2/aws" @@ -30,252 +29,93 @@ func TestHandlerCleanup(t *testing.T) { t.Error("expected DescribeInstances to be called during cleanup") } }) - - t.Run("cleanup action ignores other fields", func(t *testing.T) { - mock := &mockEC2{} - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(), nil - } - mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { - return &ec2.DescribeAddressesOutput{Addresses: []ec2types.Address{}}, nil - } - h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{ - Action: "cleanup", InstanceID: "i-1", State: "running", - }) - if err != nil { - t.Fatalf("unexpected error: %v", err) - } - }) } -// --- Ignored instances --- +// --- resolveAZ --- -func TestHandlerIgnored(t *testing.T) { - t.Run("ignored instance returns early", func(t *testing.T) { +func TestResolveAZ(t *testing.T) { + t.Run("instance not found triggers sweep", func(t *testing.T) { mock := &mockEC2{} mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { return describeResponse(), nil } - h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-skip", State: "running"}) - if err != nil { - t.Fatalf("unexpected error: %v", err) - } - if mock.callCount("DescribeInstances") != 1 { - t.Errorf("expected 1 DescribeInstances call (classify), got %d", mock.callCount("DescribeInstances")) - } - }) - - t.Run("terminated event sweeps idle NATs when instance gone", func(t *testing.T) { - mock := &mockEC2{} - natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} - natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) - var callIdx int32 - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&callIdx, 1) - if idx == 1 { - // classify: instance not found (already gone from API) - return describeResponse(), nil - } - if params.Filters != nil { - for _, f := range params.Filters { - if aws.ToString(f.Name) == "tag:nat-zero:managed" { - return describeResponse(natInst), nil - } - } - } - // findSiblings: no siblings - return describeResponse(), nil + // Sweep will call DescribeLaunchTemplates but find nothing + mock.DescribeLaunchTemplatesFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplatesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplatesOutput, error) { + return &ec2.DescribeLaunchTemplatesOutput{}, nil } h := newTestHandler(mock) err := h.HandleRequest(context.Background(), Event{InstanceID: "i-gone", State: "terminated"}) if err != nil { t.Fatalf("unexpected error: %v", err) } - if mock.callCount("StopInstances") != 1 { - t.Errorf("expected StopInstances via sweep, got %d", mock.callCount("StopInstances")) + // DescribeLaunchTemplates called for sweep + if mock.callCount("DescribeLaunchTemplates") != 1 { + t.Errorf("expected sweep via DescribeLaunchTemplates, got %d", mock.callCount("DescribeLaunchTemplates")) } }) - t.Run("non-terminating ignored event does not sweep", func(t *testing.T) { + t.Run("wrong VPC triggers sweep", func(t *testing.T) { mock := &mockEC2{} + inst := makeTestInstance("i-other", "running", "vpc-other", testAZ, nil, nil) mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + if len(params.InstanceIds) > 0 { + return describeResponse(inst), nil + } return describeResponse(), nil } - h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-skip", State: "running"}) - if err != nil { - t.Fatalf("unexpected error: %v", err) - } - if mock.callCount("StopInstances") != 0 { - t.Error("expected no sweep for non-terminating event") - } - }) -} - -// --- NAT instance events (EventBridge-driven EIP management) --- - -func TestHandlerNatEvents(t *testing.T) { - natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} - - t.Run("running NAT triggers attachEIP", func(t *testing.T) { - mock := &mockEC2{} - eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) - natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, []ec2types.InstanceNetworkInterface{eni}) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(natInst), nil - } - mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { - return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-1"), PublicIp: aws.String("1.2.3.4")}, nil - } - mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { - return &ec2.DescribeNetworkInterfacesOutput{ - NetworkInterfaces: []ec2types.NetworkInterface{{ - NetworkInterfaceId: aws.String("eni-pub1"), - }}, - }, nil - } - mock.AssociateAddressFn = func(ctx context.Context, params *ec2.AssociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AssociateAddressOutput, error) { - return &ec2.AssociateAddressOutput{}, nil + mock.DescribeLaunchTemplatesFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplatesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplatesOutput, error) { + return &ec2.DescribeLaunchTemplatesOutput{}, nil } h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-nat1", State: "running"}) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-other", State: "running"}) if err != nil { t.Fatalf("unexpected error: %v", err) } - if mock.callCount("AllocateAddress") != 1 { - t.Error("expected AllocateAddress for NAT running event") - } - if mock.callCount("AssociateAddress") != 1 { - t.Error("expected AssociateAddress for NAT running event") - } }) - t.Run("pending NAT triggers attachEIP", func(t *testing.T) { + t.Run("ignored instance triggers sweep", func(t *testing.T) { mock := &mockEC2{} - eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) - // classify returns pending NAT, then waitForState polls until running - var describeCount int32 + inst := makeTestInstance("i-ign", "running", testVPC, testAZ, + []ec2types.Tag{{Key: aws.String("nat-zero:ignore"), Value: aws.String("true")}}, nil) mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&describeCount, 1) - if idx == 1 { - // classify - return describeResponse(makeTestInstance("i-nat1", "pending", testVPC, testAZ, natTags, nil)), nil + if len(params.InstanceIds) > 0 { + return describeResponse(inst), nil } - // waitForState + getInstance — return running - return describeResponse(makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, []ec2types.InstanceNetworkInterface{eni})), nil - } - mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { - return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-1"), PublicIp: aws.String("1.2.3.4")}, nil - } - mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { - return &ec2.DescribeNetworkInterfacesOutput{ - NetworkInterfaces: []ec2types.NetworkInterface{{ - NetworkInterfaceId: aws.String("eni-pub1"), - }}, - }, nil - } - mock.AssociateAddressFn = func(ctx context.Context, params *ec2.AssociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AssociateAddressOutput, error) { - return &ec2.AssociateAddressOutput{}, nil - } - h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-nat1", State: "pending"}) - if err != nil { - t.Fatalf("unexpected error: %v", err) - } - if mock.callCount("AllocateAddress") != 1 { - t.Error("expected AllocateAddress for NAT pending event") - } - }) - - t.Run("running NAT with existing EIP is noop", func(t *testing.T) { - mock := &mockEC2{} - assoc := &ec2types.InstanceNetworkInterfaceAssociation{PublicIp: aws.String("5.6.7.8")} - eni := makeENI("eni-pub1", 0, "10.0.1.10", assoc) - natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, []ec2types.InstanceNetworkInterface{eni}) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(natInst), nil - } - h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-nat1", State: "running"}) - if err != nil { - t.Fatalf("unexpected error: %v", err) - } - if mock.callCount("AllocateAddress") != 0 { - t.Error("expected no AllocateAddress when ENI already has EIP") - } - }) - - t.Run("stopped NAT triggers detachEIP", func(t *testing.T) { - mock := &mockEC2{} - eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) - natInst := makeTestInstance("i-nat1", "stopped", testVPC, testAZ, natTags, []ec2types.InstanceNetworkInterface{eni}) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(natInst), nil - } - mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { - return &ec2.DescribeNetworkInterfacesOutput{ - NetworkInterfaces: []ec2types.NetworkInterface{{ - NetworkInterfaceId: aws.String("eni-pub1"), - Association: &ec2types.NetworkInterfaceAssociation{ - AssociationId: aws.String("eipassoc-1"), - AllocationId: aws.String("eipalloc-1"), - PublicIp: aws.String("1.2.3.4"), - }, - }}, - }, nil - } - mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { - return &ec2.DescribeAddressesOutput{Addresses: []ec2types.Address{}}, nil - } - h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-nat1", State: "stopped"}) - if err != nil { - t.Fatalf("unexpected error: %v", err) - } - if mock.callCount("DisassociateAddress") != 1 { - t.Error("expected DisassociateAddress for NAT stopped event") - } - if mock.callCount("ReleaseAddress") != 1 { - t.Error("expected ReleaseAddress for NAT stopped event") + return describeResponse(), nil } - }) - - t.Run("terminated NAT sweeps orphan EIPs", func(t *testing.T) { - mock := &mockEC2{} - natInst := makeTestInstance("i-nat1", "terminated", testVPC, testAZ, natTags, nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(natInst), nil + mock.DescribeLaunchTemplatesFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplatesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplatesOutput, error) { + return &ec2.DescribeLaunchTemplatesOutput{}, nil } h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-nat1", State: "terminated"}) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-ign", State: "running"}) if err != nil { t.Fatalf("unexpected error: %v", err) } - if mock.callCount("AllocateAddress") != 0 { - t.Error("expected no AllocateAddress for terminated NAT") - } - // sweepOrphanEIPs runs (DescribeAddresses called) - if mock.callCount("DescribeAddresses") != 1 { - t.Errorf("expected DescribeAddresses=1 (orphan sweep), got %d", mock.callCount("DescribeAddresses")) - } }) } -// --- Workload scale-up --- +// --- Reconcile: scale-up --- -func TestHandlerWorkloadScaleUp(t *testing.T) { +func TestReconcileScaleUp(t *testing.T) { workTags := []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}} - t.Run("no NAT creates one", func(t *testing.T) { + t.Run("workloads exist no NAT creates one", func(t *testing.T) { mock := &mockEC2{} - workInst := makeTestInstance("i-work1", "pending", testVPC, testAZ, workTags, nil) + workInst := makeTestInstance("i-work1", "running", testVPC, testAZ, workTags, nil) mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - if len(params.InstanceIds) > 0 && params.InstanceIds[0] == "i-work1" { + if len(params.InstanceIds) > 0 { return describeResponse(workInst), nil } - return describeResponse(), nil + // Filter queries + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(), nil // no NAT + } + } + return describeResponse(workInst), nil // workloads query + } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{}, nil } mock.DescribeLaunchTemplatesFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplatesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplatesOutput, error) { return &ec2.DescribeLaunchTemplatesOutput{ @@ -305,29 +145,26 @@ func TestHandlerWorkloadScaleUp(t *testing.T) { if mock.callCount("RunInstances") != 1 { t.Error("expected RunInstances to be called") } - // EIP is NOT managed inline anymore — no AllocateAddress expected - if mock.callCount("AllocateAddress") != 0 { - t.Error("expected no AllocateAddress (EIP managed via EventBridge)") - } }) - t.Run("stopped NAT starts it", func(t *testing.T) { + t.Run("workloads exist stopped NAT starts it", func(t *testing.T) { mock := &mockEC2{} - workInst := makeTestInstance("i-work1", "pending", testVPC, testAZ, workTags, nil) + workInst := makeTestInstance("i-work1", "running", testVPC, testAZ, workTags, nil) natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} natInst := makeTestInstance("i-nat1", "stopped", testVPC, testAZ, natTags, nil) mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - if len(params.InstanceIds) > 0 && params.InstanceIds[0] == "i-work1" { + if len(params.InstanceIds) > 0 { return describeResponse(workInst), nil } - if params.Filters != nil { - return describeResponse(natInst), nil + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil + } } - // waitForState for NAT → return stopped - return describeResponse(natInst), nil + return describeResponse(workInst), nil } - mock.StartInstancesFn = func(ctx context.Context, params *ec2.StartInstancesInput, optFns ...func(*ec2.Options)) (*ec2.StartInstancesOutput, error) { - return &ec2.StartInstancesOutput{}, nil + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{}, nil } h := newTestHandler(mock) err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "pending"}) @@ -337,365 +174,455 @@ func TestHandlerWorkloadScaleUp(t *testing.T) { if mock.callCount("StartInstances") != 1 { t.Error("expected StartInstances to be called") } - // EIP is NOT managed inline anymore - if mock.callCount("AllocateAddress") != 0 { - t.Error("expected no AllocateAddress (EIP managed via EventBridge)") - } }) - t.Run("running NAT is noop", func(t *testing.T) { + t.Run("workloads exist running NAT is noop", func(t *testing.T) { mock := &mockEC2{} - workInst := makeTestInstance("i-work1", "pending", testVPC, testAZ, workTags, nil) + workInst := makeTestInstance("i-work1", "running", testVPC, testAZ, workTags, nil) natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} - natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) - var callIdx int32 + eni := makeENI("eni-pub1", 0, "10.0.1.10", &ec2types.InstanceNetworkInterfaceAssociation{PublicIp: aws.String("1.2.3.4")}) + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, []ec2types.InstanceNetworkInterface{eni}) mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&callIdx, 1) - if idx == 1 { + if len(params.InstanceIds) > 0 { return describeResponse(workInst), nil } - return describeResponse(natInst), nil + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil + } + } + return describeResponse(workInst), nil + } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{ + Addresses: []ec2types.Address{{AllocationId: aws.String("eipalloc-1")}}, + }, nil } h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "pending"}) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "running"}) if err != nil { t.Fatalf("unexpected error: %v", err) } if mock.callCount("RunInstances") != 0 { - t.Error("expected RunInstances NOT to be called") + t.Error("expected no RunInstances") } if mock.callCount("StartInstances") != 0 { - t.Error("expected StartInstances NOT to be called") + t.Error("expected no StartInstances") + } + if mock.callCount("StopInstances") != 0 { + t.Error("expected no StopInstances") } }) - t.Run("terminated NAT creates new", func(t *testing.T) { + t.Run("workloads exist stopping NAT waits", func(t *testing.T) { mock := &mockEC2{} workInst := makeTestInstance("i-work1", "running", testVPC, testAZ, workTags, nil) natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} - natInst := makeTestInstance("i-nat1", "terminated", testVPC, testAZ, natTags, nil) - var callIdx int32 + natInst := makeTestInstance("i-nat1", "stopping", testVPC, testAZ, natTags, nil) mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&callIdx, 1) - if idx == 1 { + if len(params.InstanceIds) > 0 { return describeResponse(workInst), nil } - return describeResponse(natInst), nil - } - mock.DescribeLaunchTemplatesFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplatesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplatesOutput, error) { - return &ec2.DescribeLaunchTemplatesOutput{ - LaunchTemplates: []ec2types.LaunchTemplate{{LaunchTemplateId: aws.String("lt-123")}}, - }, nil - } - mock.DescribeLaunchTemplateVersionsFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplateVersionsInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplateVersionsOutput, error) { - return &ec2.DescribeLaunchTemplateVersionsOutput{ - LaunchTemplateVersions: []ec2types.LaunchTemplateVersion{{ - LaunchTemplateId: aws.String("lt-123"), VersionNumber: aws.Int64(1), - }}, - }, nil - } - mock.DescribeImagesFn = func(ctx context.Context, params *ec2.DescribeImagesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeImagesOutput, error) { - return &ec2.DescribeImagesOutput{Images: []ec2types.Image{}}, nil + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil + } + } + return describeResponse(workInst), nil } - mock.RunInstancesFn = func(ctx context.Context, params *ec2.RunInstancesInput, optFns ...func(*ec2.Options)) (*ec2.RunInstancesOutput, error) { - return &ec2.RunInstancesOutput{ - Instances: []ec2types.Instance{{InstanceId: aws.String("i-new")}}, - }, nil + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{}, nil } h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "running"}) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "pending"}) if err != nil { t.Fatalf("unexpected error: %v", err) } - if mock.callCount("RunInstances") != 1 { - t.Error("expected RunInstances to be called") + // No action — wait for next event when NAT reaches stopped + if mock.callCount("StartInstances") != 0 { + t.Error("expected no StartInstances (NAT is stopping)") + } + if mock.callCount("RunInstances") != 0 { + t.Error("expected no RunInstances (NAT is stopping)") } }) +} + +// --- Reconcile: scale-down --- - t.Run("shutting-down NAT creates new", func(t *testing.T) { +func TestReconcileScaleDown(t *testing.T) { + natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} + + t.Run("no workloads stops running NAT", func(t *testing.T) { mock := &mockEC2{} - workInst := makeTestInstance("i-work1", "pending", testVPC, testAZ, workTags, nil) - natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} - natInst := makeTestInstance("i-nat1", "shutting-down", testVPC, testAZ, natTags, nil) - var callIdx int32 + // Trigger is a workload that's shutting down (resolveAZ finds it) + workInst := makeTestInstance("i-work1", "shutting-down", testVPC, testAZ, nil, nil) + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&callIdx, 1) - if idx == 1 { + if len(params.InstanceIds) > 0 { return describeResponse(workInst), nil } - return describeResponse(natInst), nil - } - mock.DescribeLaunchTemplatesFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplatesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplatesOutput, error) { - return &ec2.DescribeLaunchTemplatesOutput{ - LaunchTemplates: []ec2types.LaunchTemplate{{LaunchTemplateId: aws.String("lt-123")}}, - }, nil - } - mock.DescribeLaunchTemplateVersionsFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplateVersionsInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplateVersionsOutput, error) { - return &ec2.DescribeLaunchTemplateVersionsOutput{ - LaunchTemplateVersions: []ec2types.LaunchTemplateVersion{{ - LaunchTemplateId: aws.String("lt-123"), VersionNumber: aws.Int64(1), - }}, - }, nil - } - mock.DescribeImagesFn = func(ctx context.Context, params *ec2.DescribeImagesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeImagesOutput, error) { - return &ec2.DescribeImagesOutput{Images: []ec2types.Image{}}, nil + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil + } + } + // workloads query: nothing pending/running + return describeResponse(), nil } - mock.RunInstancesFn = func(ctx context.Context, params *ec2.RunInstancesInput, optFns ...func(*ec2.Options)) (*ec2.RunInstancesOutput, error) { - return &ec2.RunInstancesOutput{ - Instances: []ec2types.Instance{{InstanceId: aws.String("i-new")}}, - }, nil + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{}, nil } h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "pending"}) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "shutting-down"}) if err != nil { t.Fatalf("unexpected error: %v", err) } - if mock.callCount("RunInstances") != 1 { - t.Error("expected RunInstances to be called") + if mock.callCount("StopInstances") != 1 { + t.Errorf("expected StopInstances=1, got %d", mock.callCount("StopInstances")) } }) -} - -// --- Workload scale-down --- -func TestHandlerWorkloadScaleDown(t *testing.T) { - workTags := []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}} - natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} - - t.Run("no NAT returns early", func(t *testing.T) { + t.Run("no workloads stopped NAT is noop", func(t *testing.T) { mock := &mockEC2{} - workInst := makeTestInstance("i-work1", "stopping", testVPC, testAZ, workTags, nil) - var callIdx int32 + workInst := makeTestInstance("i-work1", "terminated", testVPC, testAZ, nil, nil) + natInst := makeTestInstance("i-nat1", "stopped", testVPC, testAZ, natTags, nil) mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&callIdx, 1) - if idx == 1 { + if len(params.InstanceIds) > 0 { return describeResponse(workInst), nil } + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil + } + } return describeResponse(), nil } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{}, nil + } h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "stopping"}) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "terminated"}) if err != nil { t.Fatalf("unexpected error: %v", err) } if mock.callCount("StopInstances") != 0 { - t.Error("expected StopInstances NOT to be called") + t.Error("expected no StopInstances (NAT already stopped)") } }) - t.Run("siblings exist keeps NAT", func(t *testing.T) { + t.Run("workloads exist keeps NAT running", func(t *testing.T) { mock := &mockEC2{} - workInst := makeTestInstance("i-work1", "stopping", testVPC, testAZ, workTags, nil) - natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) + workTags := []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}} + triggerInst := makeTestInstance("i-work1", "stopping", testVPC, testAZ, workTags, nil) sibInst := makeTestInstance("i-sib1", "running", testVPC, testAZ, workTags, nil) - var callIdx int32 + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&callIdx, 1) - if idx == 1 { - return describeResponse(workInst), nil + if len(params.InstanceIds) > 0 { + return describeResponse(triggerInst), nil } - if params.Filters != nil { - for _, f := range params.Filters { - if aws.ToString(f.Name) == "tag:nat-zero:managed" { - return describeResponse(natInst), nil - } + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil } } return describeResponse(sibInst), nil } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{ + Addresses: []ec2types.Address{{AllocationId: aws.String("eipalloc-1")}}, + }, nil + } h := newTestHandler(mock) err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "stopping"}) if err != nil { t.Fatalf("unexpected error: %v", err) } if mock.callCount("StopInstances") != 0 { - t.Error("expected StopInstances NOT to be called") + t.Error("expected no StopInstances (siblings exist)") } }) - t.Run("no siblings stops running NAT", func(t *testing.T) { + t.Run("no workloads no NAT is noop", func(t *testing.T) { mock := &mockEC2{} - workInst := makeTestInstance("i-work1", "terminated", testVPC, testAZ, workTags, nil) - natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) - var callIdx int32 + workInst := makeTestInstance("i-work1", "terminated", testVPC, testAZ, nil, nil) mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&callIdx, 1) - if idx == 1 { + if len(params.InstanceIds) > 0 { return describeResponse(workInst), nil } - if params.Filters != nil { - for _, f := range params.Filters { - if aws.ToString(f.Name) == "tag:nat-zero:managed" { - return describeResponse(natInst), nil - } - } - } return describeResponse(), nil } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{}, nil + } h := newTestHandler(mock) err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "terminated"}) if err != nil { t.Fatalf("unexpected error: %v", err) } - if mock.callCount("StopInstances") != 1 { - t.Errorf("expected StopInstances to be called once, got %d", mock.callCount("StopInstances")) + if mock.callCount("StopInstances") != 0 { + t.Error("expected no StopInstances") + } + }) +} + +// --- Reconcile: EIP convergence --- + +func TestReconcileEIP(t *testing.T) { + natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} + workTags := []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}} + + t.Run("running NAT no EIP allocates one", func(t *testing.T) { + mock := &mockEC2{} + workInst := makeTestInstance("i-work1", "running", testVPC, testAZ, workTags, nil) + eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, []ec2types.InstanceNetworkInterface{eni}) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + if len(params.InstanceIds) > 0 { + return describeResponse(workInst), nil + } + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil + } + } + return describeResponse(workInst), nil } - // EIP is NOT released inline anymore — detachEIP happens via EventBridge - if mock.callCount("DisassociateAddress") != 0 { - t.Error("expected no DisassociateAddress (EIP managed via EventBridge)") + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{}, nil + } + mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { + return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-1"), PublicIp: aws.String("1.2.3.4")}, nil + } + mock.AssociateAddressFn = func(ctx context.Context, params *ec2.AssociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AssociateAddressOutput, error) { + return &ec2.AssociateAddressOutput{}, nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "running"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("AllocateAddress") != 1 { + t.Error("expected AllocateAddress") + } + if mock.callCount("AssociateAddress") != 1 { + t.Error("expected AssociateAddress") } }) - t.Run("no siblings NAT already stopped is noop", func(t *testing.T) { + t.Run("NAT not running releases EIPs", func(t *testing.T) { mock := &mockEC2{} - workInst := makeTestInstance("i-work1", "stopping", testVPC, testAZ, workTags, nil) natInst := makeTestInstance("i-nat1", "stopped", testVPC, testAZ, natTags, nil) - var callIdx int32 + // NAT stopped event — resolveAZ finds the NAT itself (it's in our VPC, no ignore tag) mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&callIdx, 1) - if idx == 1 { - return describeResponse(workInst), nil + if len(params.InstanceIds) > 0 { + return describeResponse(natInst), nil } - if params.Filters != nil { - for _, f := range params.Filters { - if aws.ToString(f.Name) == "tag:nat-zero:managed" { - return describeResponse(natInst), nil - } + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil } } - return describeResponse(), nil + return describeResponse(), nil // no workloads + } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{ + Addresses: []ec2types.Address{{ + AllocationId: aws.String("eipalloc-1"), + AssociationId: aws.String("eipassoc-1"), + }}, + }, nil } h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "stopping"}) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-nat1", State: "stopped"}) if err != nil { t.Fatalf("unexpected error: %v", err) } - if mock.callCount("StopInstances") != 0 { - t.Error("expected StopInstances NOT to be called") + if mock.callCount("DisassociateAddress") != 1 { + t.Errorf("expected DisassociateAddress=1, got %d", mock.callCount("DisassociateAddress")) + } + if mock.callCount("ReleaseAddress") != 1 { + t.Errorf("expected ReleaseAddress=1, got %d", mock.callCount("ReleaseAddress")) } }) - t.Run("persistent siblings keeps NAT after retries", func(t *testing.T) { + t.Run("multiple EIPs releases extras", func(t *testing.T) { mock := &mockEC2{} - workInst := makeTestInstance("i-work1", "stopping", testVPC, testAZ, workTags, nil) - natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) - sibInst := makeTestInstance("i-sib1", "running", testVPC, testAZ, workTags, nil) - var callIdx int32 + workInst := makeTestInstance("i-work1", "running", testVPC, testAZ, workTags, nil) + eni := makeENI("eni-pub1", 0, "10.0.1.10", &ec2types.InstanceNetworkInterfaceAssociation{PublicIp: aws.String("1.2.3.4")}) + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, []ec2types.InstanceNetworkInterface{eni}) mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&callIdx, 1) - if idx == 1 { + if len(params.InstanceIds) > 0 { return describeResponse(workInst), nil } - if params.Filters != nil { - for _, f := range params.Filters { - if aws.ToString(f.Name) == "tag:nat-zero:managed" { - return describeResponse(natInst), nil - } + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil } } - // All findSiblings calls return a sibling - return describeResponse(sibInst), nil + return describeResponse(workInst), nil + } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{ + Addresses: []ec2types.Address{ + {AllocationId: aws.String("eipalloc-1")}, + {AllocationId: aws.String("eipalloc-2")}, + }, + }, nil } h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "stopping"}) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "running"}) if err != nil { t.Fatalf("unexpected error: %v", err) } - if mock.callCount("StopInstances") != 0 { - t.Error("expected StopInstances NOT to be called") + // Only the extra EIP should be released (eips[1:]) + if mock.callCount("ReleaseAddress") != 1 { + t.Errorf("expected ReleaseAddress=1 (extra EIP), got %d", mock.callCount("ReleaseAddress")) } }) +} + +// --- Reconcile: config version --- - t.Run("trigger instance excluded from siblings", func(t *testing.T) { +func TestReconcileConfigVersion(t *testing.T) { + workTags := []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}} + natTags := []ec2types.Tag{ + {Key: aws.String("nat-zero:managed"), Value: aws.String("true")}, + {Key: aws.String("ConfigVersion"), Value: aws.String("old456")}, + } + + t.Run("outdated config triggers terminate", func(t *testing.T) { mock := &mockEC2{} - // The trigger workload still shows as "running" due to EC2 eventual consistency workInst := makeTestInstance("i-work1", "running", testVPC, testAZ, workTags, nil) natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) - var callIdx int32 mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&callIdx, 1) - if idx == 1 { - // classify: the trigger instance + if len(params.InstanceIds) > 0 { return describeResponse(workInst), nil } - if params.Filters != nil { - for _, f := range params.Filters { - if aws.ToString(f.Name) == "tag:nat-zero:managed" { - return describeResponse(natInst), nil - } + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil } } - // findSiblings: the trigger instance shows as running (eventual consistency) - // but should be excluded by its ID return describeResponse(workInst), nil } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{}, nil + } h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "shutting-down"}) + h.ConfigVersion = "abc123" + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "running"}) if err != nil { t.Fatalf("unexpected error: %v", err) } - if mock.callCount("StopInstances") != 1 { - t.Errorf("expected StopInstances (trigger excluded from siblings), got %d", mock.callCount("StopInstances")) + if mock.callCount("TerminateInstances") != 1 { + t.Error("expected TerminateInstances (outdated config)") + } + // No immediate replacement — next event creates new + if mock.callCount("RunInstances") != 0 { + t.Error("expected no RunInstances (replacement deferred to next event)") } }) - t.Run("pending NAT no siblings stops", func(t *testing.T) { + t.Run("current config is noop", func(t *testing.T) { mock := &mockEC2{} - workInst := makeTestInstance("i-work1", "stopped", testVPC, testAZ, workTags, nil) - natInst := makeTestInstance("i-nat1", "pending", testVPC, testAZ, natTags, nil) - var callIdx int32 + workInst := makeTestInstance("i-work1", "running", testVPC, testAZ, workTags, nil) + currentTags := []ec2types.Tag{ + {Key: aws.String("nat-zero:managed"), Value: aws.String("true")}, + {Key: aws.String("ConfigVersion"), Value: aws.String("abc123")}, + } + eni := makeENI("eni-pub1", 0, "10.0.1.10", &ec2types.InstanceNetworkInterfaceAssociation{PublicIp: aws.String("1.2.3.4")}) + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, currentTags, []ec2types.InstanceNetworkInterface{eni}) mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&callIdx, 1) - if idx == 1 { + if len(params.InstanceIds) > 0 { return describeResponse(workInst), nil } - if params.Filters != nil { - for _, f := range params.Filters { - if aws.ToString(f.Name) == "tag:nat-zero:managed" { - return describeResponse(natInst), nil - } + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil } } - return describeResponse(), nil + return describeResponse(workInst), nil + } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{ + Addresses: []ec2types.Address{{AllocationId: aws.String("eipalloc-1")}}, + }, nil } h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "stopped"}) + h.ConfigVersion = "abc123" + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "running"}) if err != nil { t.Fatalf("unexpected error: %v", err) } - if mock.callCount("StopInstances") != 1 { - t.Errorf("expected StopInstances to be called once, got %d", mock.callCount("StopInstances")) + if mock.callCount("TerminateInstances") != 0 { + t.Error("expected no TerminateInstances") + } + if mock.callCount("RunInstances") != 0 { + t.Error("expected no RunInstances") } }) } -// --- Config version replacement --- +// --- Reconcile: NAT event triggers reconcile --- -func TestHandlerConfigVersion(t *testing.T) { +func TestReconcileNATEvent(t *testing.T) { + natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} workTags := []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}} - natTags := []ec2types.Tag{ - {Key: aws.String("nat-zero:managed"), Value: aws.String("true")}, - {Key: aws.String("ConfigVersion"), Value: aws.String("old456")}, - } - t.Run("outdated config triggers replace", func(t *testing.T) { + t.Run("NAT running event with workloads attaches EIP", func(t *testing.T) { mock := &mockEC2{} workInst := makeTestInstance("i-work1", "running", testVPC, testAZ, workTags, nil) - natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) - var callIdx int32 + eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) + natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, []ec2types.InstanceNetworkInterface{eni}) mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&callIdx, 1) - if idx == 1 { - return describeResponse(workInst), nil + if len(params.InstanceIds) > 0 { + return describeResponse(natInst), nil // resolveAZ on NAT } - if params.Filters != nil { - for _, f := range params.Filters { - if aws.ToString(f.Name) == "tag:nat-zero:managed" { - return describeResponse(natInst), nil - } + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil } } - return describeResponse(makeTestInstance("i-nat1", "terminated", testVPC, testAZ, natTags, nil)), nil + return describeResponse(workInst), nil // workloads + } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{}, nil + } + mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { + return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-1"), PublicIp: aws.String("1.2.3.4")}, nil + } + mock.AssociateAddressFn = func(ctx context.Context, params *ec2.AssociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AssociateAddressOutput, error) { + return &ec2.AssociateAddressOutput{}, nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-nat1", State: "running"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("AllocateAddress") != 1 { + t.Error("expected AllocateAddress for NAT running event") + } + }) + + t.Run("NAT terminated event with workloads creates new", func(t *testing.T) { + mock := &mockEC2{} + workInst := makeTestInstance("i-work1", "running", testVPC, testAZ, workTags, nil) + natInst := makeTestInstance("i-nat1", "terminated", testVPC, testAZ, natTags, nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + if len(params.InstanceIds) > 0 { + return describeResponse(natInst), nil // resolveAZ + } + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + // findNATs: terminated NATs are filtered by state + return describeResponse(), nil + } + } + return describeResponse(workInst), nil + } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{}, nil } mock.DescribeLaunchTemplatesFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplatesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplatesOutput, error) { return &ec2.DescribeLaunchTemplatesOutput{ @@ -718,83 +645,97 @@ func TestHandlerConfigVersion(t *testing.T) { }, nil } h := newTestHandler(mock) - h.ConfigVersion = "abc123" - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "running"}) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-nat1", State: "terminated"}) if err != nil { t.Fatalf("unexpected error: %v", err) } - if mock.callCount("TerminateInstances") != 1 { - t.Error("expected TerminateInstances to be called (replace)") - } if mock.callCount("RunInstances") != 1 { - t.Error("expected RunInstances to be called (create replacement)") + t.Error("expected RunInstances for terminated NAT with active workloads") } }) +} + +// --- Sweep all AZs --- - t.Run("missing config tag skips replace", func(t *testing.T) { - // When the ConfigVersion tag is absent (e.g. EC2 eventual consistency - // delay on a just-created instance, or an older NAT), there is nothing - // to compare against so isCurrentConfig returns true and no replacement - // happens. +func TestSweepAllAZs(t *testing.T) { + t.Run("sweeps configured AZs", func(t *testing.T) { mock := &mockEC2{} - workInst := makeTestInstance("i-work1", "running", testVPC, testAZ, workTags, nil) - noVersionTags := []ec2types.Tag{ - {Key: aws.String("nat-zero:managed"), Value: aws.String("true")}, - // No ConfigVersion tag - } - natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, noVersionTags, nil) - var callIdx int32 + natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} + natInst := makeTestInstance("i-nat1", "running", testVPC, "us-east-1a", natTags, nil) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&callIdx, 1) - if idx == 1 { - return describeResponse(workInst), nil + if len(params.InstanceIds) > 0 { + // resolveAZ: instance gone + return describeResponse(), nil } - return describeResponse(natInst), nil + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(natInst), nil + } + } + // workloads: none + return describeResponse(), nil + } + mock.DescribeLaunchTemplatesFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplatesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplatesOutput, error) { + return &ec2.DescribeLaunchTemplatesOutput{ + LaunchTemplates: []ec2types.LaunchTemplate{{ + LaunchTemplateId: aws.String("lt-1a"), + Tags: []ec2types.Tag{ + {Key: aws.String("AvailabilityZone"), Value: aws.String("us-east-1a")}, + {Key: aws.String("VpcId"), Value: aws.String(testVPC)}, + }, + }}, + }, nil + } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{}, nil } h := newTestHandler(mock) - h.ConfigVersion = "abc123" - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "running"}) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-gone", State: "terminated"}) if err != nil { t.Fatalf("unexpected error: %v", err) } - if mock.callCount("TerminateInstances") != 0 { - t.Error("expected TerminateInstances NOT to be called when tag is missing") - } - if mock.callCount("RunInstances") != 0 { - t.Error("expected RunInstances NOT to be called when tag is missing") + // Should stop the idle NAT in us-east-1a + if mock.callCount("StopInstances") != 1 { + t.Errorf("expected StopInstances=1 (sweep), got %d", mock.callCount("StopInstances")) } }) +} - t.Run("current config is noop", func(t *testing.T) { +// --- Duplicate NAT --- + +func TestReconcileDuplicateNATs(t *testing.T) { + t.Run("deduplicates NATs", func(t *testing.T) { mock := &mockEC2{} - workInst := makeTestInstance("i-work1", "pending", testVPC, testAZ, workTags, nil) - currentTags := []ec2types.Tag{ - {Key: aws.String("nat-zero:managed"), Value: aws.String("true")}, - {Key: aws.String("ConfigVersion"), Value: aws.String("abc123")}, - } - natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, currentTags, nil) - var callIdx int32 + workTags := []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}} + natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} + workInst := makeTestInstance("i-work1", "running", testVPC, testAZ, workTags, nil) + eni := makeENI("eni-pub1", 0, "10.0.1.10", &ec2types.InstanceNetworkInterfaceAssociation{PublicIp: aws.String("1.2.3.4")}) + nat1 := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, []ec2types.InstanceNetworkInterface{eni}) + nat2 := makeTestInstance("i-nat2", "running", testVPC, testAZ, natTags, nil) mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&callIdx, 1) - if idx == 1 { + if len(params.InstanceIds) > 0 { return describeResponse(workInst), nil } - return describeResponse(natInst), nil + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + return describeResponse(nat1, nat2), nil + } + } + return describeResponse(workInst), nil + } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{ + Addresses: []ec2types.Address{{AllocationId: aws.String("eipalloc-1")}}, + }, nil } h := newTestHandler(mock) - h.ConfigVersion = "abc123" - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "pending"}) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "running"}) if err != nil { t.Fatalf("unexpected error: %v", err) } - if mock.callCount("RunInstances") != 0 { - t.Error("expected RunInstances NOT to be called") - } - if mock.callCount("StartInstances") != 0 { - t.Error("expected StartInstances NOT to be called") - } - if mock.callCount("TerminateInstances") != 0 { - t.Error("expected TerminateInstances NOT to be called") + if mock.callCount("TerminateInstances") != 1 { + t.Errorf("expected TerminateInstances=1 (duplicate), got %d", mock.callCount("TerminateInstances")) } }) } diff --git a/cmd/lambda/mock_test.go b/cmd/lambda/mock_test.go index 0a20583..8fadd16 100644 --- a/cmd/lambda/mock_test.go +++ b/cmd/lambda/mock_test.go @@ -3,7 +3,6 @@ package main import ( "context" "sync" - "time" "github.com/aws/aws-sdk-go-v2/aws" "github.com/aws/aws-sdk-go-v2/service/ec2" @@ -224,6 +223,5 @@ func newTestHandler(mock *mockEC2) *Handler { AMIOwner: "568608671756", AMIPattern: "fck-nat-al2023-*-arm64-*", ConfigVersion: "", - SleepFunc: func(d time.Duration) {}, // no-op sleep } } diff --git a/cmd/lambda/race_test.go b/cmd/lambda/race_test.go deleted file mode 100644 index f81b8e1..0000000 --- a/cmd/lambda/race_test.go +++ /dev/null @@ -1,821 +0,0 @@ -package main - -import ( - "context" - "fmt" - "sync/atomic" - "testing" - - "github.com/aws/aws-sdk-go-v2/aws" - "github.com/aws/aws-sdk-go-v2/service/ec2" - ec2types "github.com/aws/aws-sdk-go-v2/service/ec2/types" - "github.com/aws/smithy-go" -) - -// ============================================================================= -// Race Condition Tests -// -// Each TestRace_R* function documents and verifies the behavior of a specific -// race condition identified in the nat-zero Lambda. Race conditions arise from: -// - Multiple concurrent Lambda invocations from overlapping EventBridge events -// - EC2 API eventual consistency (state changes not immediately visible) -// -// See docs/ARCHITECTURE.md "Race Conditions" section for the full catalog. -// ============================================================================= - -// TestRace_R1_StaleSiblingEventualConsistency verifies the retry logic in -// maybeStopNAT when EC2 eventual consistency causes a dying workload to still -// appear as "running" in DescribeInstances. -// -// Race scenario: -// - Workload i-dying fires shutting-down event -// - Lambda calls findSiblings, but EC2 API still returns i-dying as "running" -// - Without mitigation, the NAT would never be stopped -// -// Mitigation: maybeStopNAT excludes the trigger instance ID from siblings AND -// retries up to 3 times with 2s sleep between attempts. -func TestRace_R1_StaleSiblingEventualConsistency(t *testing.T) { - t.Run("trigger excluded from siblings on first attempt", func(t *testing.T) { - mock := &mockEC2{} - workTags := []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}} - natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} - workInst := makeTestInstance("i-dying", "running", testVPC, testAZ, workTags, nil) - natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) - - var callIdx int32 - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&callIdx, 1) - if idx == 1 { - // classify: returns the workload instance - return describeResponse(workInst), nil - } - if params.Filters != nil { - for _, f := range params.Filters { - if aws.ToString(f.Name) == "tag:nat-zero:managed" { - return describeResponse(natInst), nil - } - } - } - // findSiblings: trigger still appears as running (eventual consistency) - // but should be excluded by excludeID - return describeResponse(workInst), nil - } - h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-dying", State: "shutting-down"}) - if err != nil { - t.Fatalf("unexpected error: %v", err) - } - if mock.callCount("StopInstances") != 1 { - t.Errorf("expected StopInstances=1 (trigger excluded), got %d", mock.callCount("StopInstances")) - } - }) - - t.Run("other stale sibling clears on retry", func(t *testing.T) { - mock := &mockEC2{} - workTags := []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}} - natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} - workInst := makeTestInstance("i-dying", "stopping", testVPC, testAZ, workTags, nil) - natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) - staleInst := makeTestInstance("i-stale", "running", testVPC, testAZ, workTags, nil) - - var callIdx int32 - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&callIdx, 1) - if idx == 1 { - return describeResponse(workInst), nil - } - if params.Filters != nil { - for _, f := range params.Filters { - if aws.ToString(f.Name) == "tag:nat-zero:managed" { - return describeResponse(natInst), nil - } - } - } - // First findSiblings call: stale sibling still running - // Subsequent calls: sibling gone (EC2 caught up) - if idx <= 3 { - return describeResponse(staleInst), nil - } - return describeResponse(), nil - } - h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-dying", State: "stopping"}) - if err != nil { - t.Fatalf("unexpected error: %v", err) - } - if mock.callCount("StopInstances") != 1 { - t.Errorf("expected StopInstances=1 after retry succeeds, got %d", mock.callCount("StopInstances")) - } - }) -} - -// TestRace_R2_TerminatedInstanceGoneFromAPI verifies the sweepIdleNATs fallback -// when classify returns ignore=true because the terminated instance has already -// been purged from the EC2 API. -// -// Race scenario: -// - Workload terminates and EventBridge fires "terminated" event -// - By the time Lambda calls DescribeInstances, the instance is gone -// - classify returns ignore=true, normal scale-down path is skipped -// -// Mitigation: handler detects isTerminating(state) + ignore and calls -// sweepIdleNATs to check all NATs in the VPC for idle ones. -func TestRace_R2_TerminatedInstanceGoneFromAPI(t *testing.T) { - t.Run("sweep stops idle NAT when trigger gone", func(t *testing.T) { - // Already covered by handler_test.go "terminated event sweeps idle NATs" - // This variant ensures the sweep mechanism works end-to-end. - mock := &mockEC2{} - natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} - natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) - var callIdx int32 - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&callIdx, 1) - if idx == 1 { - // classify: instance gone - return describeResponse(), nil - } - if params.Filters != nil { - for _, f := range params.Filters { - if aws.ToString(f.Name) == "tag:nat-zero:managed" { - return describeResponse(natInst), nil - } - } - } - // findSiblings: no siblings - return describeResponse(), nil - } - h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-gone", State: "terminated"}) - if err != nil { - t.Fatalf("unexpected error: %v", err) - } - if mock.callCount("StopInstances") != 1 { - t.Errorf("expected sweep to stop idle NAT, got StopInstances=%d", mock.callCount("StopInstances")) - } - }) - - t.Run("sweep handles multiple NATs across AZs", func(t *testing.T) { - mock := &mockEC2{} - natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} - workTags := []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}} - natAZ1 := makeTestInstance("i-nat-az1", "running", testVPC, "us-east-1a", natTags, nil) - natAZ2 := makeTestInstance("i-nat-az2", "running", testVPC, "us-east-1b", natTags, nil) - sibAZ2 := makeTestInstance("i-sib-az2", "running", testVPC, "us-east-1b", workTags, nil) - - var callIdx int32 - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&callIdx, 1) - if idx == 1 { - // classify: instance gone - return describeResponse(), nil - } - if params.Filters != nil { - for _, f := range params.Filters { - if aws.ToString(f.Name) == "tag:nat-zero:managed" { - // sweep: both NATs found - return describeResponse(natAZ1, natAZ2), nil - } - } - // findSiblings: check AZ filter - for _, f := range params.Filters { - if aws.ToString(f.Name) == "availability-zone" { - if f.Values[0] == "us-east-1b" { - return describeResponse(sibAZ2), nil - } - return describeResponse(), nil - } - } - } - return describeResponse(), nil - } - h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-gone", State: "shutting-down"}) - if err != nil { - t.Fatalf("unexpected error: %v", err) - } - // Only NAT in AZ1 should be stopped (AZ2 has a sibling) - if mock.callCount("StopInstances") != 1 { - t.Errorf("expected 1 StopInstances (only idle NAT), got %d", mock.callCount("StopInstances")) - } - }) -} - -// TestRace_R3_RetryExhaustion verifies the accepted risk when EC2 eventual -// consistency takes longer than the retry budget (3 attempts x 2s = 6s). -// -// Race scenario: -// - A sibling workload is shutting down but EC2 API never reflects the change -// within the retry window -// - findSiblings persistently returns a stale sibling on all 3 attempts -// -// Accepted risk: NAT stays running. The next scale-down event or sweepIdleNATs -// will eventually catch it. -func TestRace_R3_RetryExhaustion(t *testing.T) { - mock := &mockEC2{} - workTags := []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}} - natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} - workInst := makeTestInstance("i-dying", "stopping", testVPC, testAZ, workTags, nil) - natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) - staleInst := makeTestInstance("i-stale", "running", testVPC, testAZ, workTags, nil) - - var callIdx int32 - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&callIdx, 1) - if idx == 1 { - return describeResponse(workInst), nil - } - if params.Filters != nil { - for _, f := range params.Filters { - if aws.ToString(f.Name) == "tag:nat-zero:managed" { - return describeResponse(natInst), nil - } - } - } - // All 3 findSiblings attempts: stale sibling persists - return describeResponse(staleInst), nil - } - h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-dying", State: "stopping"}) - if err != nil { - t.Fatalf("unexpected error: %v", err) - } - // Accepted risk: NAT kept alive because stale sibling never cleared - if mock.callCount("StopInstances") != 0 { - t.Error("expected StopInstances NOT called (retry exhaustion, accepted risk)") - } -} - -// TestRace_R4_DuplicateNATCreation verifies the reactive deduplication in -// findNAT when two concurrent Lambda invocations both create a NAT instance. -// -// Race scenario: -// - Two workload pending events arrive simultaneously -// - Both Lambda invocations call findNAT → nil, both call createNAT -// - Two NAT instances now exist in the same AZ -// -// Mitigation: findNAT detects multiple NATs, keeps the first running one, -// and terminates the extras via TerminateInstances. -func TestRace_R4_DuplicateNATCreation(t *testing.T) { - t.Run("two running NATs deduplicates to one", func(t *testing.T) { - mock := &mockEC2{} - nat1 := makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, nil) - nat2 := makeTestInstance("i-nat2", "running", testVPC, testAZ, nil, nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(nat1, nat2), nil - } - h := newTestHandler(mock) - result := h.findNAT(context.Background(), testAZ, testVPC) - if result == nil { - t.Fatal("expected a NAT to be returned") - } - if result.InstanceID != "i-nat1" { - t.Errorf("expected first running NAT i-nat1, got %s", result.InstanceID) - } - if mock.callCount("TerminateInstances") != 1 { - t.Errorf("expected 1 TerminateInstances (extra NAT), got %d", mock.callCount("TerminateInstances")) - } - }) - - t.Run("running NAT preferred over stopped", func(t *testing.T) { - mock := &mockEC2{} - stopped := makeTestInstance("i-stopped", "stopped", testVPC, testAZ, nil, nil) - running := makeTestInstance("i-running", "running", testVPC, testAZ, nil, nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(stopped, running), nil - } - h := newTestHandler(mock) - result := h.findNAT(context.Background(), testAZ, testVPC) - if result == nil || result.InstanceID != "i-running" { - t.Errorf("expected running NAT to be kept, got %v", result) - } - if mock.callCount("TerminateInstances") != 1 { - t.Errorf("expected 1 TerminateInstances, got %d", mock.callCount("TerminateInstances")) - } - }) - - t.Run("three NATs terminates two extras", func(t *testing.T) { - mock := &mockEC2{} - nat1 := makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, nil) - nat2 := makeTestInstance("i-nat2", "running", testVPC, testAZ, nil, nil) - nat3 := makeTestInstance("i-nat3", "pending", testVPC, testAZ, nil, nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(nat1, nat2, nat3), nil - } - h := newTestHandler(mock) - result := h.findNAT(context.Background(), testAZ, testVPC) - if result == nil || result.InstanceID != "i-nat1" { - t.Errorf("expected first running NAT kept, got %v", result) - } - if mock.callCount("TerminateInstances") != 2 { - t.Errorf("expected 2 TerminateInstances (two extras), got %d", mock.callCount("TerminateInstances")) - } - }) -} - -// TestRace_R5_StartStopOverlap verifies behavior when a scale-up event fires -// while a concurrent scale-down is stopping the NAT. -// -// Race scenario: -// - Scale-down Lambda invocation calls StopInstances on the NAT -// - New workload pending event fires, Lambda sees NAT in "stopping" state -// - ensureNAT sees isStopping → calls startNAT -// - startNAT waits for "stopped" then calls StartInstances -// -// Accepted risk: Brief delay while NAT transitions stopping→stopped→starting. -func TestRace_R5_StartStopOverlap(t *testing.T) { - t.Run("stopping NAT waits then starts", func(t *testing.T) { - mock := &mockEC2{} - workTags := []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}} - natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} - workInst := makeTestInstance("i-work1", "pending", testVPC, testAZ, workTags, nil) - stoppingNAT := makeTestInstance("i-nat1", "stopping", testVPC, testAZ, natTags, nil) - stoppedNAT := makeTestInstance("i-nat1", "stopped", testVPC, testAZ, natTags, nil) - - var callIdx int32 - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&callIdx, 1) - if idx == 1 { - // classify: workload instance - return describeResponse(workInst), nil - } - if params.Filters != nil { - // findNAT: NAT is stopping - return describeResponse(stoppingNAT), nil - } - // waitForState in startNAT: first call returns stopping, second returns stopped - if idx <= 3 { - return describeResponse(stoppingNAT), nil - } - return describeResponse(stoppedNAT), nil - } - mock.StartInstancesFn = func(ctx context.Context, params *ec2.StartInstancesInput, optFns ...func(*ec2.Options)) (*ec2.StartInstancesOutput, error) { - return &ec2.StartInstancesOutput{}, nil - } - h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-work1", State: "pending"}) - if err != nil { - t.Fatalf("unexpected error: %v", err) - } - if mock.callCount("StartInstances") != 1 { - t.Errorf("expected StartInstances=1 (NAT restarted after stop), got %d", mock.callCount("StartInstances")) - } - }) -} - -// TestRace_R6_DoubleEIPAllocation verifies the race-detection re-check in -// attachEIP when two concurrent Lambda invocations (from pending + running -// events) both try to allocate an EIP for the same NAT. -// -// Race scenario: -// - NAT pending event → Lambda A calls attachEIP, allocates EIP-A -// - NAT running event → Lambda B calls attachEIP, allocates EIP-B -// - Both try to associate to the same ENI -// -// Mitigation: After AllocateAddress, attachEIP re-checks the ENI via -// DescribeNetworkInterfaces. If another EIP is already associated, it releases -// the duplicate allocation. -func TestRace_R6_DoubleEIPAllocation(t *testing.T) { - t.Run("re-check detects existing EIP and releases duplicate", func(t *testing.T) { - mock := &mockEC2{} - eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil - } - mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { - return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-dup"), PublicIp: aws.String("2.2.2.2")}, nil - } - // Re-check: another invocation already attached an EIP - mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { - return &ec2.DescribeNetworkInterfacesOutput{ - NetworkInterfaces: []ec2types.NetworkInterface{{ - NetworkInterfaceId: aws.String("eni-pub1"), - Association: &ec2types.NetworkInterfaceAssociation{ - PublicIp: aws.String("1.1.1.1"), // already attached by other invocation - }, - }}, - }, nil - } - h := newTestHandler(mock) - h.attachEIP(context.Background(), "i-nat1", testAZ) - - if mock.callCount("AssociateAddress") != 0 { - t.Error("expected no AssociateAddress (race detected)") - } - if mock.callCount("ReleaseAddress") != 1 { - t.Errorf("expected ReleaseAddress=1 (duplicate released), got %d", mock.callCount("ReleaseAddress")) - } - }) - - t.Run("associate fails also releases duplicate", func(t *testing.T) { - mock := &mockEC2{} - eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil - } - mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { - return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-dup"), PublicIp: aws.String("2.2.2.2")}, nil - } - // Re-check: no EIP yet (race window) - mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { - return &ec2.DescribeNetworkInterfacesOutput{ - NetworkInterfaces: []ec2types.NetworkInterface{{ - NetworkInterfaceId: aws.String("eni-pub1"), - }}, - }, nil - } - // Associate fails (e.g. other invocation won the race) - mock.AssociateAddressFn = func(ctx context.Context, params *ec2.AssociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AssociateAddressOutput, error) { - return nil, fmt.Errorf("Resource.AlreadyAssociated: EIP already associated") - } - h := newTestHandler(mock) - h.attachEIP(context.Background(), "i-nat1", testAZ) - - if mock.callCount("ReleaseAddress") != 1 { - t.Errorf("expected ReleaseAddress=1 (orphaned alloc released), got %d", mock.callCount("ReleaseAddress")) - } - }) -} - -// TestRace_R7_AssociateFailsAfterRecheck verifies that when the re-check shows -// no EIP but AssociateAddress still fails (another invocation raced between -// re-check and associate), the allocated EIP is properly released. -// -// Race scenario: -// - Lambda A: AllocateAddress → re-check ENI → no EIP → AssociateAddress -// - Lambda B: between A's re-check and associate, B associates its own EIP -// - Lambda A: AssociateAddress fails -// -// Mitigation: attachEIP releases the allocated EIP on AssociateAddress failure. -func TestRace_R7_AssociateFailsAfterRecheck(t *testing.T) { - mock := &mockEC2{} - eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil - } - mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { - return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-orphan"), PublicIp: aws.String("3.3.3.3")}, nil - } - // Re-check: no EIP (race window still open) - mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { - return &ec2.DescribeNetworkInterfacesOutput{ - NetworkInterfaces: []ec2types.NetworkInterface{{ - NetworkInterfaceId: aws.String("eni-pub1"), - }}, - }, nil - } - // Associate fails: another invocation raced us - mock.AssociateAddressFn = func(ctx context.Context, params *ec2.AssociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AssociateAddressOutput, error) { - return nil, fmt.Errorf("InvalidParameterValue: EIP already in use") - } - h := newTestHandler(mock) - h.attachEIP(context.Background(), "i-nat1", testAZ) - - if mock.callCount("ReleaseAddress") != 1 { - t.Errorf("expected ReleaseAddress=1 (orphaned allocation), got %d", mock.callCount("ReleaseAddress")) - } -} - -// apiError implements smithy.APIError for test use. -type apiError struct { - code string - message string -} - -func (e *apiError) Error() string { return e.message } -func (e *apiError) ErrorCode() string { return e.code } -func (e *apiError) ErrorMessage() string { return e.message } -func (e *apiError) ErrorFault() smithy.ErrorFault { return smithy.FaultServer } - -// Ensure apiError satisfies the smithy.APIError interface. -var _ smithy.APIError = (*apiError)(nil) - -// TestRace_R8_DisassociateAlreadyRemoved verifies that detachEIP handles the -// case where EC2 auto-disassociates the EIP when the instance stops, before -// Lambda's detachEIP runs. -// -// Race scenario: -// - NAT instance stops, EC2 auto-disassociates the EIP from the ENI -// - Lambda's detachEIP fires, gets stale association data from DescribeNetworkInterfaces -// - DisassociateAddress returns InvalidAssociationID.NotFound -// -// Mitigation: detachEIP catches InvalidAssociationID.NotFound and still proceeds -// to release the EIP allocation. -func TestRace_R8_DisassociateAlreadyRemoved(t *testing.T) { - mock := &mockEC2{} - eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(makeTestInstance("i-nat1", "stopped", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil - } - // ENI still shows stale association data - mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { - return &ec2.DescribeNetworkInterfacesOutput{ - NetworkInterfaces: []ec2types.NetworkInterface{{ - NetworkInterfaceId: aws.String("eni-pub1"), - Association: &ec2types.NetworkInterfaceAssociation{ - AssociationId: aws.String("eipassoc-stale"), - AllocationId: aws.String("eipalloc-1"), - PublicIp: aws.String("1.2.3.4"), - }, - }}, - }, nil - } - // Disassociate fails: EC2 already removed it - mock.DisassociateAddressFn = func(ctx context.Context, params *ec2.DisassociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.DisassociateAddressOutput, error) { - return nil, &apiError{code: "InvalidAssociationID.NotFound", message: "Association not found"} - } - mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { - return &ec2.DescribeAddressesOutput{Addresses: []ec2types.Address{}}, nil - } - h := newTestHandler(mock) - h.detachEIP(context.Background(), "i-nat1", testAZ) - - if mock.callCount("DisassociateAddress") != 1 { - t.Errorf("expected DisassociateAddress=1 (attempted), got %d", mock.callCount("DisassociateAddress")) - } - // Critical: ReleaseAddress must still be called despite disassociate "failure" - if mock.callCount("ReleaseAddress") != 1 { - t.Errorf("expected ReleaseAddress=1 (EIP freed despite NotFound), got %d", mock.callCount("ReleaseAddress")) - } -} - -// TestRace_R9_DisassociateNonNotFoundError verifies the current behavior when -// DisassociateAddress fails with a non-NotFound error (e.g. throttling). -// -// Race scenario: -// - Lambda calls DisassociateAddress but gets throttled -// - detachEIP returns early without releasing the EIP allocation -// -// UNMITIGATED: The EIP is leaked. However, the orphan sweep in a subsequent -// detachEIP invocation will clean it up. -func TestRace_R9_DisassociateNonNotFoundError(t *testing.T) { - t.Run("throttle error skips release (documents gap)", func(t *testing.T) { - mock := &mockEC2{} - eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(makeTestInstance("i-nat1", "stopped", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil - } - mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { - return &ec2.DescribeNetworkInterfacesOutput{ - NetworkInterfaces: []ec2types.NetworkInterface{{ - NetworkInterfaceId: aws.String("eni-pub1"), - Association: &ec2types.NetworkInterfaceAssociation{ - AssociationId: aws.String("eipassoc-1"), - AllocationId: aws.String("eipalloc-1"), - PublicIp: aws.String("1.2.3.4"), - }, - }}, - }, nil - } - // DisassociateAddress fails with throttle (not NotFound) - mock.DisassociateAddressFn = func(ctx context.Context, params *ec2.DisassociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.DisassociateAddressOutput, error) { - return nil, fmt.Errorf("Throttling: Rate exceeded") - } - h := newTestHandler(mock) - h.detachEIP(context.Background(), "i-nat1", testAZ) - - // Current behavior: returns early, ReleaseAddress NOT called (the gap) - if mock.callCount("ReleaseAddress") != 0 { - t.Error("expected ReleaseAddress=0 (current behavior: early return on non-NotFound error)") - } - // Orphan sweep also skipped because we returned early - if mock.callCount("DescribeAddresses") != 0 { - t.Error("expected DescribeAddresses=0 (orphan sweep skipped due to early return)") - } - }) - - t.Run("orphan sweep cleans up on next successful detach", func(t *testing.T) { - mock := &mockEC2{} - eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(makeTestInstance("i-nat1", "stopped", testVPC, testAZ, nil, []ec2types.InstanceNetworkInterface{eni})), nil - } - // No current association (already gone) - mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { - return &ec2.DescribeNetworkInterfacesOutput{ - NetworkInterfaces: []ec2types.NetworkInterface{{ - NetworkInterfaceId: aws.String("eni-pub1"), - }}, - }, nil - } - // Orphan sweep finds the leaked EIP from previous failed detach - mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { - return &ec2.DescribeAddressesOutput{ - Addresses: []ec2types.Address{{ - AllocationId: aws.String("eipalloc-leaked"), - }}, - }, nil - } - h := newTestHandler(mock) - h.detachEIP(context.Background(), "i-nat1", testAZ) - - // Orphan sweep cleans up the leaked EIP - if mock.callCount("ReleaseAddress") != 1 { - t.Errorf("expected ReleaseAddress=1 (orphan sweep), got %d", mock.callCount("ReleaseAddress")) - } - }) -} - -// TestRace_R10_ENIAvailabilityTimeout verifies the behavior when ENIs never -// reach "available" status during replaceNAT, e.g. due to EC2 API delays. -// -// Race scenario: -// - replaceNAT terminates old NAT and waits for ENIs to become "available" -// - DescribeNetworkInterfaces keeps returning "in-use" (EC2 delay) -// - Wait loop exhausts all 60 iterations -// -// Accepted risk: createNAT proceeds anyway. The launch template may fail to -// attach the ENI, but the next workload event will retry. -func TestRace_R10_ENIAvailabilityTimeout(t *testing.T) { - mock := &mockEC2{} - natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} - - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - // All getInstance calls return terminated (waitForTermination succeeds immediately) - return describeResponse(makeTestInstance("i-old", "terminated", testVPC, testAZ, natTags, nil)), nil - } - // ENI never becomes available (always in-use) - mock.DescribeNetworkInterfacesFn = func(ctx context.Context, params *ec2.DescribeNetworkInterfacesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeNetworkInterfacesOutput, error) { - return &ec2.DescribeNetworkInterfacesOutput{ - NetworkInterfaces: []ec2types.NetworkInterface{ - {NetworkInterfaceId: aws.String("eni-1"), Status: ec2types.NetworkInterfaceStatusInUse}, - }, - }, nil - } - mock.DescribeLaunchTemplatesFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplatesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplatesOutput, error) { - return &ec2.DescribeLaunchTemplatesOutput{ - LaunchTemplates: []ec2types.LaunchTemplate{{LaunchTemplateId: aws.String("lt-123")}}, - }, nil - } - mock.DescribeLaunchTemplateVersionsFn = func(ctx context.Context, params *ec2.DescribeLaunchTemplateVersionsInput, optFns ...func(*ec2.Options)) (*ec2.DescribeLaunchTemplateVersionsOutput, error) { - return &ec2.DescribeLaunchTemplateVersionsOutput{ - LaunchTemplateVersions: []ec2types.LaunchTemplateVersion{{ - LaunchTemplateId: aws.String("lt-123"), VersionNumber: aws.Int64(1), - }}, - }, nil - } - mock.DescribeImagesFn = func(ctx context.Context, params *ec2.DescribeImagesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeImagesOutput, error) { - return &ec2.DescribeImagesOutput{Images: []ec2types.Image{}}, nil - } - mock.RunInstancesFn = func(ctx context.Context, params *ec2.RunInstancesInput, optFns ...func(*ec2.Options)) (*ec2.RunInstancesOutput, error) { - return &ec2.RunInstancesOutput{ - Instances: []ec2types.Instance{{InstanceId: aws.String("i-new")}}, - }, nil - } - - h := newTestHandler(mock) - eni := makeENI("eni-1", 0, "10.0.1.10", nil) - inst := &Instance{ - InstanceID: "i-old", - StateName: "running", - NetworkInterfaces: []ec2types.InstanceNetworkInterface{eni}, - } - result := h.replaceNAT(context.Background(), inst, testAZ, testVPC) - - // createNAT still called despite ENI timeout (accepted risk) - if result != "i-new" { - t.Errorf("expected createNAT to proceed despite ENI timeout, got %q", result) - } - if mock.callCount("RunInstances") != 1 { - t.Errorf("expected RunInstances=1 (createNAT proceeded), got %d", mock.callCount("RunInstances")) - } - // ENI wait should have polled all 60 iterations - if mock.callCount("DescribeNetworkInterfaces") < 2 { - t.Errorf("expected multiple DescribeNetworkInterfaces polls, got %d", mock.callCount("DescribeNetworkInterfaces")) - } -} - -// TestRace_R11_EIPOrphanOnNATTermination verifies that when a NAT instance is -// terminated (not stopped), orphan EIPs are cleaned up via sweepOrphanEIPs. -// -// Race scenario: -// - NAT is terminated by replaceNAT, spot reclaim, or manual action -// - No stopping/stopped EventBridge events fire, so detachEIP never runs -// - The EIP allocation leaks (still allocated, no longer associated) -// -// Mitigation: handler detects isTerminating(state) for NAT events and calls -// sweepOrphanEIPs to release any EIPs tagged for that AZ. -func TestRace_R11_EIPOrphanOnNATTermination(t *testing.T) { - t.Run("shutting-down NAT sweeps orphan EIPs", func(t *testing.T) { - mock := &mockEC2{} - natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} - natInst := makeTestInstance("i-nat1", "shutting-down", testVPC, testAZ, natTags, nil) - - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(natInst), nil - } - // Orphan EIP left behind from the now-terminating NAT - mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { - return &ec2.DescribeAddressesOutput{ - Addresses: []ec2types.Address{{ - AllocationId: aws.String("eipalloc-orphan"), - }}, - }, nil - } - h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-nat1", State: "shutting-down"}) - if err != nil { - t.Fatalf("unexpected error: %v", err) - } - if mock.callCount("ReleaseAddress") != 1 { - t.Errorf("expected ReleaseAddress=1 (orphan EIP cleaned up), got %d", mock.callCount("ReleaseAddress")) - } - }) - - t.Run("terminated NAT sweeps orphan EIPs", func(t *testing.T) { - mock := &mockEC2{} - natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} - natInst := makeTestInstance("i-nat1", "terminated", testVPC, testAZ, natTags, nil) - - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(natInst), nil - } - mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { - return &ec2.DescribeAddressesOutput{ - Addresses: []ec2types.Address{{ - AllocationId: aws.String("eipalloc-orphan"), - AssociationId: aws.String("eipassoc-stale"), - }}, - }, nil - } - h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-nat1", State: "terminated"}) - if err != nil { - t.Fatalf("unexpected error: %v", err) - } - // Should disassociate (stale) then release - if mock.callCount("DisassociateAddress") != 1 { - t.Errorf("expected DisassociateAddress=1, got %d", mock.callCount("DisassociateAddress")) - } - if mock.callCount("ReleaseAddress") != 1 { - t.Errorf("expected ReleaseAddress=1, got %d", mock.callCount("ReleaseAddress")) - } - }) - - t.Run("no orphan EIPs is noop", func(t *testing.T) { - mock := &mockEC2{} - natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} - natInst := makeTestInstance("i-nat1", "shutting-down", testVPC, testAZ, natTags, nil) - - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - return describeResponse(natInst), nil - } - mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { - return &ec2.DescribeAddressesOutput{Addresses: []ec2types.Address{}}, nil - } - h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-nat1", State: "shutting-down"}) - if err != nil { - t.Fatalf("unexpected error: %v", err) - } - if mock.callCount("ReleaseAddress") != 0 { - t.Error("expected no ReleaseAddress when no orphans") - } - }) -} - -// TestRace_R12_SweepIdleNATsLacksRetry documents that sweepIdleNATs calls -// findSiblings once per NAT without the retry loop that maybeStopNAT uses. -// -// Race scenario: -// - sweepIdleNATs fires (R2 fallback: classify can't find trigger instance) -// - findSiblings returns a stale sibling due to EC2 eventual consistency -// - NAT is not stopped because it appears to have active workloads -// -// Accepted risk: sweepIdleNATs is itself a fallback for the rare case where -// both shutting-down and terminated events fail to classify. Adding retry -// here would compound Lambda execution time for a path that rarely fires. -// The next lifecycle event will eventually stop the NAT. -func TestRace_R12_SweepIdleNATsLacksRetry(t *testing.T) { - mock := &mockEC2{} - natTags := []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}} - workTags := []ec2types.Tag{{Key: aws.String("App"), Value: aws.String("web")}} - natInst := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, nil) - staleInst := makeTestInstance("i-stale", "running", testVPC, testAZ, workTags, nil) - - var callIdx int32 - mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { - idx := atomic.AddInt32(&callIdx, 1) - if idx == 1 { - // classify: instance gone - return describeResponse(), nil - } - if params.Filters != nil { - for _, f := range params.Filters { - if aws.ToString(f.Name) == "tag:nat-zero:managed" { - return describeResponse(natInst), nil - } - } - } - // findSiblings: stale sibling (no retry in sweep path) - return describeResponse(staleInst), nil - } - h := newTestHandler(mock) - err := h.HandleRequest(context.Background(), Event{InstanceID: "i-gone", State: "terminated"}) - if err != nil { - t.Fatalf("unexpected error: %v", err) - } - // Accepted risk: NAT not stopped because stale sibling found (no retry) - if mock.callCount("StopInstances") != 0 { - t.Error("expected StopInstances=0 (sweep has no retry, stale sibling blocks stop)") - } -} diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index 7fb7182..ddc93ef 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -27,7 +27,8 @@ The module deploys an EventBridge rule that watches for EC2 state changes, and a │ ┌──────────────────┐ ┌──────────────────┐ │ │ │ EventBridge │───>│ Lambda Function │ │ │ │ EC2 State Change │ │ nat-zero │ │ - │ └──────────────────┘ └────────┬─────────┘ │ + │ └──────────────────┘ │ concurrency = 1 │ │ + │ └────────┬─────────┘ │ │ │ │ │ ┌──────────────┼──────────────┐ │ │ │ │ │ │ @@ -39,71 +40,112 @@ The module deploys an EventBridge rule that watches for EC2 state changes, and a └──────────────────────────────────────────────────────────────────┘ ``` -## Event Flow +## Reconciliation Model + +The Lambda uses a **reconciliation pattern** with **reserved concurrency of 1** (single writer). Every invocation performs the same observe-compare-act loop regardless of which event triggered it: + +1. **Resolve**: determine the AZ from the trigger instance (or sweep all AZs if the instance is gone) +2. **Observe**: query workloads, NAT instances, and EIPs for that AZ +3. **Decide**: compare actual state to desired state +4. **Act**: take at most ONE mutating action, then return + +The next event picks up where this one left off. No waiting, no polling, no retries. + +### Why Single Writer Eliminates Races + +With `reserved_concurrent_executions = 1`, only one Lambda invocation runs at a time. Events that arrive during execution are queued by the Lambda service and processed sequentially. This means: + +- No duplicate NAT creation (only one invocation can call `RunInstances`) +- No double EIP allocation (only one invocation can call `AllocateAddress`) +- No start/stop overlap (only one invocation can modify the NAT state) +- No need for re-check loops or retry logic + +### Reconciliation Logic + +``` +reconcile(az, vpc): + workloads = findWorkloads(az, vpc) # pending/running, excluding NAT + ignored + nats = findNATs(az, vpc) # pending/running/stopping/stopped + eips = findEIPs(az) # tagged for this AZ + + # Deduplicate NATs (safety net for pre-existing duplicates) + if len(nats) > 1: terminateDuplicateNATs(nats) + + needNAT = len(workloads) > 0 + + # --- NAT convergence (one action per invocation) --- + if needNAT: + if no NAT or NAT terminating: createNAT → return + if NAT has outdated config: terminateNAT → return + if NAT stopped: startNAT → return + if NAT stopping: return (wait for next event) + # NAT pending or running — good + else: + if NAT running or pending: stopNAT (Force) → return + # NAT stopping/stopped/nil — good + + # --- EIP convergence --- + if NAT running and no EIPs: allocateAndAttachEIP → return + if NAT not running and EIPs exist: releaseEIPs → return + if multiple EIPs: releaseExtras → return + + # Converged — no action needed +``` + +### Event Agnosticism + +The reconciler does NOT care whether the trigger event came from a NAT instance or a workload. There is no classify step that branches on instance type. + +- **Workload `pending` event** → resolveAZ → reconcile → creates/starts NAT if needed +- **NAT `running` event** → resolveAZ → reconcile → attaches EIP if needed +- **Workload `terminated` event** → resolveAZ → reconcile → stops NAT if no workloads +- **NAT `stopped` event** → resolveAZ → reconcile → releases EIP if present +- **Instance gone from API** → sweep all configured AZs → reconcile each -Every EC2 state change in the account fires an EventBridge event. The Lambda classifies each instance as: **ignore** (wrong VPC / ignore tag), **NAT** (has `nat-zero:managed=true` tag), or **workload** (everything else). +The event is just a signal that "something changed in this AZ." The reconciler always computes the correct answer from current state. + +## Event Flow ### Scale-up: Workload starts, NAT created ``` 1. Workload → pending - Lambda: classify → workload, starting - Action: findNAT → none → createNAT (RunInstances) + reconcile: workloads=1, NAT=nil → createNAT 2. NAT → pending - Lambda: classify → NAT, starting - Action: attachEIP → wait for running... (not yet, will retry on next event) + reconcile: workloads=1, NAT=pending, EIPs=0 → converged (NAT not yet running) 3. NAT → running - Lambda: classify → NAT, starting - Action: attachEIP → instance running → allocate EIP → associate to public ENI + reconcile: workloads=1, NAT=running, EIPs=0 → allocateAndAttachEIP Result: NAT has internet via EIP ✓ 4. Workload → running - Lambda: classify → workload, starting - Action: findNAT → found running NAT → no-op + reconcile: workloads=1, NAT=running, EIPs=1 → converged ✓ ``` ### Scale-down: Workload terminates, NAT stopped ``` 1. Workload → shutting-down - Lambda: classify → workload, terminating - Action: findNAT → found running NAT → findSiblings → none (3x retry) → stopNAT + reconcile: workloads=0, NAT=running → stopNAT (Force=true) 2. NAT → stopping - Lambda: classify → NAT, stopping - Action: detachEIP → wait for stopped... (not yet) + reconcile: workloads=0, NAT=stopping → converged (waiting for stopped) 3. NAT → stopped - Lambda: classify → NAT, stopping - Action: detachEIP → instance stopped → disassociate EIP → release EIP + reconcile: workloads=0, NAT=stopped, EIPs=1 → releaseEIPs Result: NAT idle, no EIP charge ✓ - -4. Workload → terminated - Lambda: classify → workload, terminating - Action: findNAT → found stopped NAT → NAT not in starting state → no-op ``` ### Restart: New workload starts, stopped NAT restarted ``` 1. New workload → pending - Lambda: classify → workload, starting - Action: findNAT → found stopped NAT → startNAT (wait stopped → StartInstances) + reconcile: workloads=1, NAT=stopped → startNAT -2. NAT → pending - Lambda: classify → NAT, starting - Action: attachEIP → wait for running... (not yet) - -3. NAT → running - Lambda: classify → NAT, starting - Action: attachEIP → instance running → allocate EIP → associate to public ENI +2. NAT → pending → running + reconcile: workloads=1, NAT=running, EIPs=0 → allocateAndAttachEIP Result: NAT has internet via EIP ✓ - -4. New workload → running - Lambda: classify → workload, starting - Action: findNAT → found running NAT → no-op ``` ### Terraform destroy @@ -114,101 +156,6 @@ Action: find all NAT instances → terminate → release all EIPs Result: clean state for ENI/SG destruction ✓ ``` -### Why this is safe from races - -- **EIP attach is idempotent**: `attachEIP` checks if the ENI already has an EIP before allocating. Multiple concurrent `running` events for the same NAT are harmless. -- **EIP detach is idempotent**: `detachEIP` checks if the ENI has an association before releasing. -- **NAT dedup**: `findNAT` terminates extras if multiple NATs exist in one AZ. -- **Workload handlers never touch EIPs**: Only NAT events manage EIPs. Workload events only start/stop/create NAT instances. - -## Scale-Up Sequence - -``` - Workload EventBridge Lambda EC2 API NAT Instance - Instance (per AZ) - │ │ │ │ │ - │ state:"pending"│ │ │ │ - ├───────────────>│ │ │ │ - │ │ invoke │ │ │ - │ ├───────────────>│ │ │ - │ │ │ │ │ - │ │ │ describe_instances(id) │ - │ │ ├───────────────>│ │ - │ │ │<───────────────┤ │ - │ │ │ │ │ - │ │ │ Check: VPC matches? Not ignored? Not NAT? - │ │ │ │ │ - │ │ │ describe_instances(NAT tag, AZ, VPC) - │ │ ├───────────────>│ │ - │ │ │<───────────────┤ │ - │ │ │ │ │ - │ ┌─────┴────────────────┴────────────────┴────────┐ │ - │ │ IF no NAT instance: │ │ - │ │ describe_launch_templates(AZ, VPC) │ │ - │ │ describe_images(fck-nat pattern) │ │ - │ │ run_instances(template, AMI) ──────────────>│──────>│ Created - │ │ │ │ - │ │ IF NAT stopped: │ │ - │ │ start_instances(nat_id) ───────────────────>│──────>│ Starting - │ │ │ │ - │ │ IF NAT already running: │ │ - │ │ No action needed │ │ - │ └─────┬────────────────┬────────────────┬────────┘ │ - │ │ │ │ │ - │ │ │ │ state:"running" - │ │ invoke │ │<───────────────┤ - │ ├───────────────>│ │ │ - │ │ │ allocate_address │ - │ │ │ associate_address │ - │ │ ├───────────────>│ │ - │ │ │ │──── EIP ──────>│ - │ │ │ │ NAT ready -``` - -## Scale-Down Sequence - -``` - Workload EventBridge Lambda EC2 API NAT Instance - Instance (per AZ) - │ │ │ │ │ - │state:"stopping"│ │ │ │ - ├───────────────>│ │ │ │ - │ │ invoke │ │ │ - │ ├───────────────>│ │ │ - │ │ │ │ │ - │ │ │ describe_instances(id) │ - │ │ ├───────────────>│ │ - │ │ │ Check: VPC, not ignored, not NAT - │ │ │ │ │ - │ │ ┌──────────┴──────────┐ │ │ - │ │ │ Retry loop (3x, 2s) │ │ │ - │ │ │ describe_instances │ │ │ - │ │ │ (AZ, VPC, running) ├───>│ │ - │ │ │ filter out NAT + │<───┤ │ - │ │ │ ignored instances │ │ │ - │ │ └──────────┬──────────┘ │ │ - │ │ │ │ │ - │ ┌─────┴────────────────┴────────────────┴────────┐ │ - │ │ IF no siblings remain: │ │ - │ │ stop_instances(nat_id) ─────────────────────>│──────>│ Stopping - │ │ │ │ - │ │ IF siblings still running: │ │ - │ │ Keep NAT running, no action │ │ - │ └─────┬────────────────┬────────────────┬────────┘ │ - │ │ │ │ │ - │ │ │ │ state:"stopped" - │ │ invoke │ │<───────────────┤ - │ ├───────────────>│ │ │ - │ │ │ disassociate_address │ - │ │ │ release_address │ - │ │ ├───────────────>│ │ - │ │ │ │ EIP released │ - │ │ │ │ │ - │ │ │ │ NAT stopped │ - │ │ │ │ Cost: ~$0.80/mo - │ │ │ │ (EBS only) │ -``` - ## Dual ENI Architecture Each NAT instance uses two Elastic Network Interfaces (ENIs) to separate public and private traffic. ENIs are pre-created by Terraform and attached via the launch template, so they persist across instance stop/start cycles. @@ -235,7 +182,7 @@ Each NAT instance uses two Elastic Network Interfaces (ENIs) to separate public Key design decisions: - **Pre-created ENIs**: ENIs are Terraform-managed and referenced in the launch template. They survive instance stop/start, preserving route table entries. - **source_dest_check=false**: Required on both ENIs for NAT to work (instance forwards packets not addressed to itself). -- **EIP lifecycle**: Elastic IPs are allocated when the NAT instance reaches "running" and released when it reaches "stopped", both via EventBridge events. This avoids charges for unused EIPs. +- **EIP lifecycle**: Elastic IPs are allocated when the NAT instance reaches "running" and released when it reaches "stopped", both managed by the reconciliation loop. This avoids charges for unused EIPs. ## Comparison with fck-nat @@ -254,14 +201,15 @@ nat-zero builds on top of fck-nat -- it uses the same AMI and the same iptables- │ ┌──────────────────────┐ │ │ v │ │ │ NAT Instance │ │ │ ┌────────────┐ │ │ │ Always running │ │ │ │ Lambda │ │ - │ │ │ │ │ │ Orchestr. │ │ - │ └──────────────────────┘ │ │ └──────┬─────┘ │ - │ │ │ │ │ - │ Cost: ~$7-8/mo │ │ v │ - │ (instance + EIP 24/7) │ │ ┌────────────────────┐ │ - │ Self-healing via ASG │ │ │ NAT Instance │ │ - │ No Lambda needed │ │ │ Started on demand │ │ - └────────────────────────────┘ │ │ Stopped when idle │ │ + │ │ │ │ │ │ Reconciler │ │ + │ └──────────────────────┘ │ │ │ (conc = 1) │ │ + │ │ │ └──────┬─────┘ │ + │ Cost: ~$7-8/mo │ │ │ │ + │ (instance + EIP 24/7) │ │ v │ + │ Self-healing via ASG │ │ ┌────────────────────┐ │ + │ No Lambda needed │ │ │ NAT Instance │ │ + └────────────────────────────┘ │ │ Started on demand │ │ + │ │ Stopped when idle │ │ │ └────────────────────┘ │ │ │ │ Cost: ~$0.80/mo (idle) │ @@ -281,139 +229,5 @@ Costs per AZ, per month. Includes the [AWS public IPv4 charge](https://aws.amazo | Scale-to-zero | No | Yes | | Self-healing | ASG replaces unhealthy | Lambda creates new on demand | | AMI | fck-nat AMI | fck-nat AMI (same) | -| Complexity | Low (ASG only) | Higher (Lambda + EventBridge) | +| Complexity | Low (ASG only) | Moderate (Lambda + EventBridge) | | Best for | Production 24/7 | Dev/staging, intermittent workloads | - -## Race Conditions - -Because multiple Lambda invocations can fire concurrently from overlapping EventBridge events, and because the EC2 API is eventually consistent, the Lambda must handle numerous race conditions. This section catalogs each identified race, its severity, and how (or whether) it is mitigated. - -### Race Condition Catalog - -| ID | Description | Trigger | Mitigation | Status | Test | -|----|-------------|---------|------------|--------|------| -| R1 | **Stale sibling from EC2 eventual consistency** — dying workload still shows as `running` in DescribeInstances | Scale-down event fires before EC2 API reflects the state change | `findSiblings` excludes trigger instance ID; `maybeStopNAT` retries 3x with 2s delay | MITIGATED | `TestRace_R1` | -| R2 | **Terminated instance gone from API** — `classify` returns `ignore=true`, scale-down event lost | Instance already purged from EC2 API by the time Lambda runs | Handler detects `isTerminating(state)` + `ignore` and calls `sweepIdleNATs` to check all NATs | MITIGATED | `TestRace_R2` | -| R3 | **Retry exhaustion** — EC2 consistency takes >6s (3x2s retries), false siblings persist | Unusually long EC2 API propagation delay | None — NAT stays running until next event or sweep catches it | ACCEPTED | `TestRace_R3` | -| R4 | **Duplicate NAT creation** — two concurrent workload events both see no NAT, both call `createNAT` | Two workloads start simultaneously in the same AZ | `findNAT` detects multiple NATs, keeps the first running one, terminates extras | MITIGATED | `TestRace_R4` | -| R5 | **Start/stop overlap** — scale-up starts NAT while concurrent scale-down stops it | Workload starts while last workload is terminating | `startNAT` waits for `stopped` state then starts; brief delay but correct | ACCEPTED | `TestRace_R5` | -| R6 | **Double EIP allocation** — concurrent pending+running events both allocate EIPs | Two EventBridge events for same NAT instance arrive concurrently | `attachEIP` re-checks ENI after `AllocateAddress`; releases duplicate if EIP already present | MITIGATED | `TestRace_R6` | -| R7 | **Associate fails after re-check** — another invocation associates between re-check and `AssociateAddress` | Very tight race window between DescribeNetworkInterfaces and AssociateAddress | `attachEIP` releases allocated EIP on `AssociateAddress` failure | MITIGATED | `TestRace_R7` | -| R8 | **Disassociate on already-removed association** — EC2 auto-disassociates EIP on stop before Lambda runs | EC2 instance stop completes and auto-removes EIP before `detachEIP` | `detachEIP` catches `InvalidAssociationID.NotFound` and still releases the allocation | MITIGATED | `TestRace_R8` | -| R9 | **Orphan EIP from non-NotFound error** — `DisassociateAddress` fails with throttle/other error | API throttling during EIP detach | `detachEIP` returns early without releasing; orphan sweep on next detach cleans up | UNMITIGATED | `TestRace_R9` | -| R10 | **ENI availability timeout** — ENI never reaches `available` after terminate | EC2 delay in releasing ENI from terminated instance | `replaceNAT` proceeds with `createNAT` after timeout; launch template may fail but next event retries | ACCEPTED | `TestRace_R10` | -| R11 | **EIP orphan on NAT termination** — NAT terminated without stop cycle, `detachEIP` never fires | `replaceNAT`, spot reclaim, manual termination | Handler detects `isTerminating(state)` for NAT events and calls `sweepOrphanEIPs` to release tagged EIPs | MITIGATED | `TestRace_R11` | -| R12 | **sweepIdleNATs lacks retry** — stale sibling blocks sweep from stopping idle NAT | EC2 eventual consistency during fallback sweep path | None — sweep is itself a rare fallback (R2); retry budget would compound Lambda execution time | ACCEPTED | `TestRace_R12` | - -### Why Event-Driven NAT Has Races - -Traditional NAT (e.g. fck-nat with ASG) runs a single instance continuously — no concurrency, no races. nat-zero trades that simplicity for cost savings by reacting to events. This means: - -1. **Multiple triggers per lifecycle**: A single workload going from `pending` → `running` fires two EventBridge events, each invoking a separate Lambda. A NAT instance similarly fires `pending` → `running` → `stopping` → `stopped`, each potentially overlapping with workload events. - -2. **EC2 eventual consistency**: When EventBridge fires a `shutting-down` event, `DescribeInstances` may still return the instance as `running` for several seconds. This is the root cause of R1, R2, and R3. - -3. **No distributed lock**: Lambda invocations run independently with no shared state. The EC2 API itself is the only coordination point, and it's eventually consistent. - -### Sequence Diagrams - -#### R1: Stale Sibling (Scale-Down Race) - -``` - EventBridge Lambda A EC2 API - │ │ │ - │ shutting-down │ │ - │ (i-work1) │ │ - ├───────────────────>│ │ - │ │ DescribeInstances │ - │ │ (findSiblings, │ - │ │ exclude=i-work1) │ - │ ├─────────────────────>│ - │ │ │ i-work1 still shows - │ │<─────────────────────┤ "running" (stale!) - │ │ │ BUT excluded by ID - │ │ │ - │ │ No siblings found │ - │ │ StopInstances(NAT) │ - │ ├─────────────────────>│ - │ │ │ -``` - -Without the `excludeID` parameter, i-work1 would count as a sibling and the NAT would never stop. The retry loop (R3) handles cases where a *different* workload is stale. - -#### R4: Duplicate NAT Creation (Scale-Up Race) - -``` - EventBridge Lambda A Lambda B EC2 API - │ │ │ │ - │ pending │ │ │ - │ (i-work1) │ │ │ - ├───────────────>│ │ │ - │ pending │ │ │ - │ (i-work2) │ │ │ - ├───────────────────────────────────────>│ │ - │ │ │ │ - │ │ findNAT → nil │ │ - │ ├──────────────────────────────────────────>│ - │ │ │ findNAT → nil │ - │ │ ├───────────────────>│ - │ │ │ │ - │ │ RunInstances │ │ - │ │ → i-nat1 │ │ - │ ├──────────────────────────────────────────>│ - │ │ │ RunInstances │ - │ │ │ → i-nat2 │ - │ │ ├───────────────────>│ - │ │ │ │ - │ ┌─────┴──────────────────────┴─────┐ │ - │ │ Later: any findNAT call sees │ │ - │ │ both i-nat1 and i-nat2 │ │ - │ │ → keeps first running NAT │ │ - │ │ → TerminateInstances(extra) │ │ - │ └──────────────────────────────────┘ │ -``` - -#### R6: Double EIP Allocation (Concurrent attachEIP) - -``` - EventBridge Lambda A Lambda B EC2 API - │ │ │ │ - │ pending │ │ │ - │ (NAT) │ │ │ - ├───────────────>│ │ │ - │ running │ │ │ - │ (NAT) │ │ │ - ├───────────────────────────────────────>│ │ - │ │ │ │ - │ │ Check ENI: no EIP │ │ - │ ├──────────────────────────────────────────>│ - │ │ │ Check ENI: no EIP │ - │ │ ├───────────────────>│ - │ │ │ │ - │ │ AllocateAddress │ │ - │ │ → eipalloc-A │ │ - │ ├──────────────────────────────────────────>│ - │ │ │ AllocateAddress │ - │ │ │ → eipalloc-B │ - │ │ ├───────────────────>│ - │ │ │ │ - │ │ Re-check ENI: │ │ - │ │ still no EIP │ │ - │ ├──────────────────────────────────────────>│ - │ │ │ │ - │ │ AssociateAddress │ │ - │ │ (eipalloc-A) │ │ - │ ├──────────────────────────────────────────>│ - │ │ │ │ - │ │ │ Re-check ENI: │ - │ │ │ EIP-A present! │ - │ │ ├───────────────────>│ - │ │ │ │ - │ │ │ Race detected! │ - │ │ │ ReleaseAddress │ - │ │ │ (eipalloc-B) │ - │ │ ├───────────────────>│ - │ │ │ │ -``` - -If Lambda B's re-check also misses EIP-A (very tight window), `AssociateAddress` will fail and Lambda B releases eipalloc-B in the error handler. The orphan sweep in `detachEIP` provides a final safety net. diff --git a/lambda.tf b/lambda.tf index e37bd35..96d311f 100644 --- a/lambda.tf +++ b/lambda.tf @@ -52,16 +52,17 @@ resource "null_resource" "build_lambda" { } resource "aws_lambda_function" "nat_zero" { - filename = "${path.module}/.build/lambda.zip" - function_name = "${var.name}-nat-zero" - handler = "bootstrap" - role = aws_iam_role.lambda_iam_role.arn - runtime = "provided.al2023" - source_code_hash = fileexists("${path.module}/.build/lambda.zip") ? filebase64sha256("${path.module}/.build/lambda.zip") : null - architectures = ["arm64"] - timeout = 300 - memory_size = var.lambda_memory_size - tags = local.common_tags + filename = "${path.module}/.build/lambda.zip" + function_name = "${var.name}-nat-zero" + handler = "bootstrap" + role = aws_iam_role.lambda_iam_role.arn + runtime = "provided.al2023" + source_code_hash = fileexists("${path.module}/.build/lambda.zip") ? filebase64sha256("${path.module}/.build/lambda.zip") : null + architectures = ["arm64"] + timeout = 30 + reserved_concurrent_executions = 1 + memory_size = var.lambda_memory_size + tags = local.common_tags environment { variables = { From cce49de4e70491c2a833f138e7f2b105615422a1 Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Wed, 25 Feb 2026 17:46:37 +1000 Subject: [PATCH 13/30] fix: correct stale NAT state from EC2 filter eventual consistency MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When the NAT "running" EventBridge event triggers reconcile, filter-based DescribeInstances may still return "pending" due to EC2 eventual consistency. The reconciler would skip EIP allocation (requires running state), log "converged", and no further events would re-trigger it — leaving the NAT running without an EIP indefinitely. Fix: pass the EventBridge event into reconcile() and, when the NAT appears "pending" from filters but the event says "running" for the same instance, re-query by instance ID (which returns authoritative state) before proceeding to EIP convergence. Co-Authored-By: Claude Opus 4.6 --- cmd/lambda/handler.go | 19 ++++++++++++---- cmd/lambda/handler_test.go | 44 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 59 insertions(+), 4 deletions(-) diff --git a/cmd/lambda/handler.go b/cmd/lambda/handler.go index a100b4a..2934c0b 100644 --- a/cmd/lambda/handler.go +++ b/cmd/lambda/handler.go @@ -47,7 +47,7 @@ func (h *Handler) handle(ctx context.Context, event Event) error { return nil } - h.reconcile(ctx, az, vpc) + h.reconcile(ctx, az, vpc, event) return nil } @@ -73,13 +73,13 @@ func (h *Handler) sweepAllAZs(ctx context.Context) { defer timed("sweep_all_azs")() azs := h.findConfiguredAZs(ctx) for _, az := range azs { - h.reconcile(ctx, az, h.TargetVPC) + h.reconcile(ctx, az, h.TargetVPC, Event{}) } } // reconcile observes the current state of workloads, NAT, and EIPs in an AZ, // then takes at most one mutating action to converge toward the desired state. -func (h *Handler) reconcile(ctx context.Context, az, vpc string) { +func (h *Handler) reconcile(ctx context.Context, az, vpc string, event Event) { defer timed("reconcile")() workloads := h.findWorkloads(ctx, az, vpc) @@ -119,7 +119,18 @@ func (h *Handler) reconcile(ctx context.Context, az, vpc string) { log.Printf("NAT %s is stopping, waiting for next event", nat.InstanceID) return } - // nat is pending or running — good + // nat is pending or running — good. + // If the NAT appears "pending" from filters but the EventBridge event + // says it's "running", re-query by instance ID for authoritative state. + // Filter-based DescribeInstances is subject to EC2 eventual consistency + // and may lag behind the actual state transition. + if nat.StateName == "pending" && event.InstanceID == nat.InstanceID && event.State == "running" { + log.Printf("NAT %s shows pending in filters but event says running, re-querying", nat.InstanceID) + fresh := h.getInstance(ctx, nat.InstanceID) + if fresh != nil { + nat = fresh + } + } } else { if nat != nil && (nat.StateName == "running" || nat.StateName == "pending") { log.Printf("No workloads in %s, stopping NAT %s", az, nat.InstanceID) diff --git a/cmd/lambda/handler_test.go b/cmd/lambda/handler_test.go index 7dfce1b..0d659eb 100644 --- a/cmd/lambda/handler_test.go +++ b/cmd/lambda/handler_test.go @@ -605,6 +605,50 @@ func TestReconcileNATEvent(t *testing.T) { } }) + t.Run("NAT running event with stale pending filter attaches EIP", func(t *testing.T) { + // Simulates EC2 eventual consistency: EventBridge says "running" but + // filter-based DescribeInstances still returns "pending". The reconciler + // should re-query by instance ID and get the true "running" state. + mock := &mockEC2{} + workInst := makeTestInstance("i-work1", "running", testVPC, testAZ, workTags, nil) + eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) + natPending := makeTestInstance("i-nat1", "pending", testVPC, testAZ, natTags, []ec2types.InstanceNetworkInterface{eni}) + natRunning := makeTestInstance("i-nat1", "running", testVPC, testAZ, natTags, []ec2types.InstanceNetworkInterface{eni}) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + if len(params.InstanceIds) > 0 { + // By-ID queries return the true state + return describeResponse(natRunning), nil + } + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + // Filter query lags — still shows pending + return describeResponse(natPending), nil + } + } + return describeResponse(workInst), nil + } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{}, nil + } + mock.AllocateAddressFn = func(ctx context.Context, params *ec2.AllocateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AllocateAddressOutput, error) { + return &ec2.AllocateAddressOutput{AllocationId: aws.String("eipalloc-1"), PublicIp: aws.String("1.2.3.4")}, nil + } + mock.AssociateAddressFn = func(ctx context.Context, params *ec2.AssociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.AssociateAddressOutput, error) { + return &ec2.AssociateAddressOutput{}, nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-nat1", State: "running"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("AllocateAddress") != 1 { + t.Errorf("expected AllocateAddress=1, got %d (stale pending should be corrected via by-ID query)", mock.callCount("AllocateAddress")) + } + if mock.callCount("AssociateAddress") != 1 { + t.Errorf("expected AssociateAddress=1, got %d", mock.callCount("AssociateAddress")) + } + }) + t.Run("NAT terminated event with workloads creates new", func(t *testing.T) { mock := &mockEC2{} workInst := makeTestInstance("i-work1", "running", testVPC, testAZ, workTags, nil) From 884510aaca63584df9571137887f3b6460a907b1 Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Wed, 25 Feb 2026 18:17:29 +1000 Subject: [PATCH 14/30] test: always dump Lambda CloudWatch logs before destroy The log group gets deleted during terraform destroy, so dump logs on every run (not just on failure) for observability into reconciler behavior. Co-Authored-By: Claude Opus 4.6 --- tests/integration/nat_zero_test.go | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/tests/integration/nat_zero_test.go b/tests/integration/nat_zero_test.go index e08350d..bf8429c 100644 --- a/tests/integration/nat_zero_test.go +++ b/tests/integration/nat_zero_test.go @@ -152,13 +152,11 @@ func TestNatZero(t *testing.T) { } }() - // Dump Lambda CloudWatch logs on failure for diagnostics. + // Dump Lambda CloudWatch logs before destroy for diagnostics. cwClient := cloudwatchlogs.New(sess) logGroup := fmt.Sprintf("/aws/lambda/%s", lambdaName) defer func() { - if t.Failed() { - dumpLambdaLogs(t, cwClient, logGroup) - } + dumpLambdaLogs(t, cwClient, logGroup) }() amiID := getLatestAL2023AMI(t, ec2Client) From 98722a2b3559539e5bba84a961875b6dfa981733 Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Wed, 25 Feb 2026 18:40:28 +1000 Subject: [PATCH 15/30] fix: distinguish waiting from converged in reconcile log, expand log dump - Log "waiting" instead of "converged" when NAT is still pending, since the system hasn't actually converged (EIP can't be attached yet). - Read CloudWatch logs from head with a 500-event limit instead of 50-event tail, so the full Lambda lifecycle is captured in test output. Co-Authored-By: Claude Opus 4.6 --- cmd/lambda/handler.go | 9 +++++++-- tests/integration/nat_zero_test.go | 4 ++-- 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/cmd/lambda/handler.go b/cmd/lambda/handler.go index 2934c0b..0c064b5 100644 --- a/cmd/lambda/handler.go +++ b/cmd/lambda/handler.go @@ -158,8 +158,13 @@ func (h *Handler) reconcile(ctx context.Context, az, vpc string, event Event) { return } - log.Printf("Reconcile %s: converged (workloads=%d, nat=%s, eips=%d)", - az, len(workloads), natState(nat), len(eips)) + if nat != nil && nat.StateName == "pending" { + log.Printf("Reconcile %s: waiting (workloads=%d, nat=pending, eips=%d)", + az, len(workloads), len(eips)) + } else { + log.Printf("Reconcile %s: converged (workloads=%d, nat=%s, eips=%d)", + az, len(workloads), natState(nat), len(eips)) + } } func natState(nat *Instance) string { diff --git a/tests/integration/nat_zero_test.go b/tests/integration/nat_zero_test.go index bf8429c..9b81aa5 100644 --- a/tests/integration/nat_zero_test.go +++ b/tests/integration/nat_zero_test.go @@ -472,8 +472,8 @@ func dumpLambdaLogs(t *testing.T, client *cloudwatchlogs.CloudWatchLogs, logGrou events, err := client.GetLogEvents(&cloudwatchlogs.GetLogEventsInput{ LogGroupName: aws.String(logGroup), LogStreamName: stream.LogStreamName, - StartFromHead: aws.Bool(false), - Limit: aws.Int64(50), + StartFromHead: aws.Bool(true), + Limit: aws.Int64(500), }) if err != nil { t.Logf("Warning: could not read log events: %v", err) From 245904f215dc1d890cbd8cc074dafc1dc999235a Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Wed, 25 Feb 2026 19:34:08 +1000 Subject: [PATCH 16/30] fix: log waiting for nat=stopping, reduce Lambda memory to 128 MB - The !needNAT path logged "converged" when NAT was still stopping, which is a transient state. Now logs "waiting" and returns early, matching the pending fix on the needNAT side. - Lambda uses 29-30 MB at peak; 128 MB is more than sufficient and halves the per-invocation cost vs 256 MB. Co-Authored-By: Claude Opus 4.6 --- cmd/lambda/handler.go | 7 ++++++- variables.tf | 2 +- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/cmd/lambda/handler.go b/cmd/lambda/handler.go index 0c064b5..d2b2e1f 100644 --- a/cmd/lambda/handler.go +++ b/cmd/lambda/handler.go @@ -137,7 +137,12 @@ func (h *Handler) reconcile(ctx context.Context, az, vpc string, event Event) { h.stopInstance(ctx, nat.InstanceID) return } - // nat is stopping/stopped/nil — good + if nat != nil && nat.StateName == "stopping" { + log.Printf("Reconcile %s: waiting (workloads=0, nat=stopping, eips=%d)", + az, len(eips)) + return + } + // nat is stopped/nil — good } // --- EIP convergence --- diff --git a/variables.tf b/variables.tf index 74e9d65..1e09963 100644 --- a/variables.tf +++ b/variables.tf @@ -113,7 +113,7 @@ variable "ignore_tag_value" { variable "lambda_memory_size" { type = number - default = 256 + default = 128 description = "Memory allocated to the Lambda function in MB (also scales CPU proportionally)" validation { From b41f3d6eeab2e0ab497c2202b6fd9d7050be1c38 Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Wed, 25 Feb 2026 20:22:57 +1000 Subject: [PATCH 17/30] docs: update terraform-docs for lambda_memory_size default change Co-Authored-By: Claude Opus 4.6 --- README.md | 2 +- docs/REFERENCE.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 5a62f85..d5b6470 100644 --- a/README.md +++ b/README.md @@ -191,7 +191,7 @@ No modules. | [ignore\_tag\_value](#input\_ignore\_tag\_value) | Tag value used to mark instances the Lambda should ignore | `string` | `"true"` | no | | [instance\_type](#input\_instance\_type) | Instance type for the NAT instance | `string` | `"t4g.nano"` | no | | [lambda\_binary\_url](#input\_lambda\_binary\_url) | URL to the pre-compiled Go Lambda zip. Updated automatically by CI. | `string` | `"https://github.com/MachineDotDev/nat-zero/releases/download/nat-zero-lambda-latest/lambda.zip"` | no | -| [lambda\_memory\_size](#input\_lambda\_memory\_size) | Memory allocated to the Lambda function in MB (also scales CPU proportionally) | `number` | `256` | no | +| [lambda\_memory\_size](#input\_lambda\_memory\_size) | Memory allocated to the Lambda function in MB (also scales CPU proportionally) | `number` | `128` | no | | [log\_retention\_days](#input\_log\_retention\_days) | CloudWatch log retention in days (only used when enable\_logging is true) | `number` | `14` | no | | [market\_type](#input\_market\_type) | Whether to use spot or on-demand instances | `string` | `"on-demand"` | no | | [name](#input\_name) | Name prefix for all resources created by this module | `string` | n/a | yes | diff --git a/docs/REFERENCE.md b/docs/REFERENCE.md index 30791e4..4683876 100644 --- a/docs/REFERENCE.md +++ b/docs/REFERENCE.md @@ -59,7 +59,7 @@ No modules. | [ignore\_tag\_value](#input\_ignore\_tag\_value) | Tag value used to mark instances the Lambda should ignore | `string` | `"true"` | no | | [instance\_type](#input\_instance\_type) | Instance type for the NAT instance | `string` | `"t4g.nano"` | no | | [lambda\_binary\_url](#input\_lambda\_binary\_url) | URL to the pre-compiled Go Lambda zip. Updated automatically by CI. | `string` | `"https://github.com/MachineDotDev/nat-zero/releases/download/nat-zero-lambda-latest/lambda.zip"` | no | -| [lambda\_memory\_size](#input\_lambda\_memory\_size) | Memory allocated to the Lambda function in MB (also scales CPU proportionally) | `number` | `256` | no | +| [lambda\_memory\_size](#input\_lambda\_memory\_size) | Memory allocated to the Lambda function in MB (also scales CPU proportionally) | `number` | `128` | no | | [log\_retention\_days](#input\_log\_retention\_days) | CloudWatch log retention in days (only used when enable\_logging is true) | `number` | `14` | no | | [market\_type](#input\_market\_type) | Whether to use spot or on-demand instances | `string` | `"on-demand"` | no | | [name](#input\_name) | Name prefix for all resources created by this module | `string` | n/a | yes | From 87440a8ca4814d94f128bb62c9b5a7300f28fb63 Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Thu, 26 Feb 2026 05:27:09 +1000 Subject: [PATCH 18/30] fix: wait for NAT fully terminated before terraform destroy The cleanup phase checked only pending/running states to verify NAT termination, but the instance could still be in shutting-down state with ENIs attached. Terraform destroy would then try to detach ENIs, which the CI IAM role lacks permission for. Now waits for the instance to fully disappear from all non-terminated states. Co-Authored-By: Claude Opus 4.6 --- tests/integration/nat_zero_test.go | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/tests/integration/nat_zero_test.go b/tests/integration/nat_zero_test.go index 9b81aa5..a96fc19 100644 --- a/tests/integration/nat_zero_test.go +++ b/tests/integration/nat_zero_test.go @@ -387,10 +387,14 @@ func TestNatZero(t *testing.T) { // Verify NAT instances are terminated. t.Log("Verifying NAT instances terminated...") natTermStart := time.Now() - retry.DoWithRetry(t, "NAT terminated", 20, 5*time.Second, func() (string, error) { - nats := findNATInstances(t, ec2Client, vpcID) + retry.DoWithRetry(t, "NAT terminated", 40, 5*time.Second, func() (string, error) { + // Wait for fully terminated (not just absent from pending/running) + // so ENIs are released before terraform destroy tries to delete them. + nats := findNATInstancesInState(t, ec2Client, vpcID, + []string{"pending", "running", "shutting-down", "stopping", "stopped"}) if len(nats) > 0 { - return "", fmt.Errorf("still %d running NAT instances", len(nats)) + return "", fmt.Errorf("still %d NAT instances (%s)", + len(nats), aws.StringValue(nats[0].State.Name)) } return "OK", nil }) From 3cad0ae0bde5efc9015b6893c3da6fe1a36bf7b5 Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Thu, 26 Feb 2026 05:43:23 +1000 Subject: [PATCH 19/30] fix: wait for NAT instance termination in cleanup before returning The Lambda's cleanupAll (invoked by terraform destroy) called TerminateInstances but returned immediately without waiting. Since the module's ENIs use delete_on_termination=false, they remain attached until the instance fully terminates. If Terraform proceeds to delete ENIs while the instance is still shutting down, it needs ec2:DetachNetworkInterface which users may not have. Now polls DescribeInstances until all terminated instances disappear, guaranteeing ENIs are detached before the Lambda returns and Terraform continues its destroy plan. Co-Authored-By: Claude Opus 4.6 --- cmd/lambda/ec2ops.go | 36 ++++++++++++++++++++++++++++++++++-- cmd/lambda/ec2ops_test.go | 9 +++++++++ 2 files changed, 43 insertions(+), 2 deletions(-) diff --git a/cmd/lambda/ec2ops.go b/cmd/lambda/ec2ops.go index c93b1d5..9e149fb 100644 --- a/cmd/lambda/ec2ops.go +++ b/cmd/lambda/ec2ops.go @@ -7,6 +7,7 @@ import ( "log" "sort" "strings" + "time" "github.com/aws/aws-sdk-go-v2/aws" "github.com/aws/aws-sdk-go-v2/service/ec2" @@ -442,6 +443,7 @@ func (h *Handler) cleanupAll(ctx context.Context) { h.EC2.TerminateInstances(ctx, &ec2.TerminateInstancesInput{ InstanceIds: instanceIDs, }) + h.waitForTermination(ctx, instanceIDs) } // Release EIPs. @@ -471,10 +473,40 @@ func (h *Handler) cleanupAll(ctx context.Context) { } } } +} - if len(instanceIDs) > 0 { - log.Println("NAT instance termination initiated") +// waitForTermination polls until all instances reach the terminated state, +// ensuring ENIs are fully detached before returning. This is critical for +// terraform destroy: the module's pre-created ENIs (delete_on_termination=false) +// remain attached until the instance is fully terminated. If cleanupAll returns +// before termination completes, Terraform may try to delete still-attached ENIs. +func (h *Handler) waitForTermination(ctx context.Context, instanceIDs []string) { + defer timed("wait_for_termination")() + for attempt := 0; attempt < 60; attempt++ { + time.Sleep(2 * time.Second) + resp, err := h.EC2.DescribeInstances(ctx, &ec2.DescribeInstancesInput{ + InstanceIds: instanceIDs, + Filters: []ec2types.Filter{ + {Name: aws.String("instance-state-name"), Values: []string{ + "pending", "running", "shutting-down", "stopping", "stopped", + }}, + }, + }) + if err != nil { + log.Printf("Error polling termination status: %v", err) + return + } + remaining := 0 + for _, r := range resp.Reservations { + remaining += len(r.Instances) + } + if remaining == 0 { + log.Printf("All %d NAT instances terminated", len(instanceIDs)) + return + } + log.Printf("Waiting for %d instance(s) to terminate...", remaining) } + log.Printf("Timed out waiting for instance termination") } // isErrCode returns true if the error (or any wrapped error) has the given diff --git a/cmd/lambda/ec2ops_test.go b/cmd/lambda/ec2ops_test.go index f69bb0f..8d6f40c 100644 --- a/cmd/lambda/ec2ops_test.go +++ b/cmd/lambda/ec2ops_test.go @@ -513,9 +513,18 @@ func TestCleanupAll(t *testing.T) { mock := &mockEC2{} nat1 := makeTestInstance("i-nat1", "running", testVPC, testAZ, nil, nil) nat2 := makeTestInstance("i-nat2", "stopped", testVPC, testAZ, nil, nil) + terminated := false mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + if terminated { + // After termination, instances are gone (waitForTermination polls this) + return describeResponse(), nil + } return describeResponse(nat1, nat2), nil } + mock.TerminateInstancesFn = func(ctx context.Context, params *ec2.TerminateInstancesInput, optFns ...func(*ec2.Options)) (*ec2.TerminateInstancesOutput, error) { + terminated = true + return &ec2.TerminateInstancesOutput{}, nil + } mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { return &ec2.DescribeAddressesOutput{ Addresses: []ec2types.Address{{ From 5395766a1bd33ba31d766e365318757989de0555 Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Thu, 26 Feb 2026 10:02:59 +1000 Subject: [PATCH 20/30] docs: simplify all documentation to match reconciliation pattern Rewrite docs with KISS philosophy: clear decision matrix tables, accurate performance numbers from real CloudWatch data, remove outdated event-driven/reactive language and Python comparison history. Net reduction of ~388 lines. Co-Authored-By: Claude Opus 4.5 --- README.md | 131 ++++++++------------ docs/ARCHITECTURE.md | 277 +++++++++++++++---------------------------- docs/INDEX.md | 135 ++------------------- docs/PERFORMANCE.md | 181 ++++++++++------------------ docs/TESTING.md | 186 +++++++---------------------- mkdocs.yml | 2 +- 6 files changed, 262 insertions(+), 650 deletions(-) diff --git a/README.md b/README.md index d5b6470..5734ab9 100644 --- a/README.md +++ b/README.md @@ -2,68 +2,61 @@ **Scale-to-zero NAT instances for AWS.** Stop paying for NAT when nothing is running. -nat-zero is a Terraform module that brings event-driven, scale-to-zero NAT to your AWS VPCs. When a workload starts in a private subnet, a NAT instance spins up automatically. When the last workload stops, the NAT shuts down and its Elastic IP is released. You pay nothing while idle -- just ~\$0.80/mo for a stopped EBS volume. +nat-zero is a Terraform module that replaces always-on NAT with on-demand NAT instances. When a workload launches in a private subnet, a NAT instance starts automatically. When the last workload stops, the NAT shuts down and its Elastic IP is released. Idle cost: ~$0.80/month per AZ. -Built on [fck-nat](https://fck-nat.dev/) AMIs. Orchestrated by a Go Lambda with a 55 ms cold start. Proven by real integration tests that deploy infrastructure and verify connectivity end-to-end. +Built on [fck-nat](https://fck-nat.dev/) AMIs. Orchestrated by a single Go Lambda (~55 ms cold start, 29 MB memory). Integration-tested against real AWS infrastructure on every PR. ``` - CONTROL PLANE - ┌──────────────────────────────────────────────────┐ - │ EventBridge ──> Lambda (NAT Orchestrator) │ - │ │ start/stop instances │ - │ │ allocate/release EIPs │ - └────────────────────┼─────────────────────────────┘ - │ - ┌────────────┴────────────┐ - v v - AZ-A (active) AZ-B (idle) - ┌──────────────────┐ ┌──────────────────┐ - │ Workloads │ │ No workloads │ - │ ↓ route table │ │ No NAT instance │ - │ Private ENI │ │ No EIP │ - │ ↓ │ │ │ - │ NAT Instance │ │ Cost: ~$0.80/mo │ - │ ↓ │ │ (EBS only) │ - │ Public ENI + EIP │ │ │ - │ ↓ │ └──────────────────┘ + AZ-A (active) AZ-B (idle) + ┌──────────────────┐ ┌──────────────────┐ + │ Workloads │ │ No workloads │ + │ ↓ route table │ │ No NAT instance │ + │ Private ENI │ │ No EIP │ + │ ↓ │ │ │ + │ NAT Instance │ │ Cost: ~$0.80/mo │ + │ ↓ │ │ (EBS only) │ + │ Public ENI + EIP │ │ │ + │ ↓ │ └──────────────────┘ │ Internet Gateway │ └──────────────────┘ + ▲ + EventBridge → Lambda (reconciler, concurrency=1) ``` ## Why nat-zero? -AWS NAT Gateway costs a minimum of ~\$36/month per AZ -- even if nothing is using it. fck-nat brings that down to ~\$7-8/month, but the instance and its public IP still run 24/7. - -**nat-zero takes it further.** When your private subnets are idle, there's no NAT instance running and no Elastic IP allocated. Your cost drops to the price of a stopped 2 GB EBS volume: about 80 cents a month. - -This matters most for: - -- **Dev and staging environments** that sit idle nights and weekends -- **CI/CD runners** that spin up for minutes, then disappear for hours -- **Batch and cron workloads** that run periodically -- **Side projects** where every dollar counts - -### Cost comparison (per AZ, per month) - | State | nat-zero | fck-nat | NAT Gateway | |-------|----------|---------|-------------| -| **Idle** (no workloads) | **~\$0.80** | ~\$7-8 | ~\$36+ | -| **Active** (workloads running) | ~\$7-8 | ~\$7-8 | ~\$36+ | +| **Idle** (no workloads) | **~$0.80/mo** | ~$7-8 | ~$36+ | +| **Active** (workloads running) | ~$7-8 | ~$7-8 | ~$36+ | + +AWS NAT Gateway costs ~$36/month per AZ even when idle. fck-nat brings that to ~$7-8/month, but the instance and EIP run 24/7. nat-zero releases the Elastic IP when idle, avoiding the [$3.60/month public IPv4 charge](https://aws.amazon.com/blogs/aws/new-aws-public-ipv4-address-charge-public-ip-insights/). -The key: nat-zero **releases the Elastic IP when idle**, avoiding the [\$3.60/month public IPv4 charge](https://aws.amazon.com/blogs/aws/new-aws-public-ipv4-address-charge-public-ip-insights/) that fck-nat and NAT Gateway pay around the clock. +Best for dev/staging environments, CI/CD runners, batch jobs, and side projects where workloads run intermittently. ## How it works -An EventBridge rule watches for EC2 instance state changes in your VPC. A Lambda function reacts to each event: +An EventBridge rule captures EC2 instance state changes. A Lambda function (concurrency=1, single writer) runs a **reconciliation loop** on each event: + +1. **Observe** — query workloads, NAT instances, and EIPs in the AZ +2. **Decide** — compare actual state to desired state +3. **Act** — take at most one mutating action, then return + +The event is just a trigger — the reconciler always computes the correct action from current state. With `reserved_concurrent_executions=1`, events are processed sequentially, eliminating race conditions. -- **Workload starts** in a private subnet -- Lambda creates (or restarts) a NAT instance in that AZ and attaches an Elastic IP -- **Last workload stops** in an AZ -- Lambda stops the NAT instance and releases the Elastic IP -- **NAT instance reaches "running"** -- Lambda attaches an EIP to the public ENI -- **NAT instance reaches "stopped"** -- Lambda detaches and releases the EIP +| Workloads? | NAT State | Action | +|------------|-----------|--------| +| Yes | None / terminated | Create NAT | +| Yes | Stopped | Start NAT | +| Yes | Stopping | Wait | +| Yes | Running, no EIP | Attach EIP | +| No | Running / pending | Stop NAT | +| No | Stopped, has EIP | Release EIP | +| — | Multiple NATs | Terminate duplicates | -Each NAT instance uses two persistent ENIs (public + private) pre-created by Terraform. They survive stop/start cycles, so route tables stay intact and there's no need to reconfigure anything when a NAT comes back. +Each NAT uses two persistent ENIs (public + private) created by Terraform. They survive stop/start cycles, keeping route tables intact. -See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for detailed event flows and sequence diagrams. +See [Architecture](docs/ARCHITECTURE.md) for the full reconciliation model and event flow diagrams. ## Quick start @@ -80,54 +73,32 @@ module "nat_zero" { private_route_table_ids = module.vpc.private_route_table_ids private_subnets_cidr_blocks = module.vpc.private_subnets_cidr_blocks - tags = { - Environment = "dev" - } + tags = { Environment = "dev" } } ``` -See [docs/EXAMPLES.md](docs/EXAMPLES.md) for complete working configurations including spot instances, custom AMIs, and building from source. +See [Examples](docs/EXAMPLES.md) for spot instances, custom AMIs, and building from source. ## Performance -The orchestrator Lambda is written in Go and compiled to a native ARM64 binary. It was rewritten from Python to eliminate cold start overhead -- init latency dropped from 667 ms to 55 ms, a **90% improvement**. Peak memory usage went from 98 MB down to 30 MB. - | Scenario | Time to connectivity | |----------|---------------------| -| First workload in AZ (cold create) | ~15 seconds | +| First workload (cold create) | ~10.7 s | +| Restart from stopped | ~8.5 s | | NAT already running | Instant | -| Restart from stopped | ~12 seconds | - -See [docs/PERFORMANCE.md](docs/PERFORMANCE.md) for detailed Lambda execution timings, instance type guidance, and cost breakdowns. - -## Tested against real infrastructure - -nat-zero isn't just unit-tested -- it's integration-tested against real AWS infrastructure on every PR. The test suite uses [Terratest](https://terratest.gruntwork.io/) to deploy the full module, launch workloads, verify NAT creation and connectivity, exercise scale-down and restart, then tear everything down cleanly. - -See [docs/TESTING.md](docs/TESTING.md) for phase-by-phase documentation. - -## When to use this module - -| Use case | nat-zero | fck-nat | NAT Gateway | -|----------|----------|---------|-------------| -| Dev/staging with intermittent workloads | **Best fit** | Wasteful | Very wasteful | -| Production 24/7 workloads | Overkill | **Best fit** | Simplest | -| Cost-sensitive environments | **Best fit** | Good | Expensive | -| Simplicity priority | More moving parts | **Simpler** | Simplest | - -**Use nat-zero** when your private subnet workloads run intermittently and you want to pay nothing when idle. -**Use fck-nat** when workloads run 24/7 and you want simplicity with ASG self-healing. +The Lambda is a compiled Go ARM64 binary. Cold start: 55 ms. Typical invocation: 400-600 ms. Peak memory: 29 MB. The startup delay is dominated by EC2 instance boot, not the Lambda. -**Use NAT Gateway** when you prioritize managed simplicity and availability over cost. +See [Performance](docs/PERFORMANCE.md) for detailed timings and cost breakdowns. -## Important notes +## Notes -- **EventBridge scope**: The rule captures all EC2 state changes in the account. The Lambda filters by VPC ID, so it only acts on instances in your target VPC. -- **Startup delay**: The first workload in an idle AZ waits ~15 seconds for internet. Design startup scripts to retry outbound connections -- most package managers already do. -- **Dual ENI**: Each AZ gets persistent public + private ENIs that survive instance stop/start cycles. -- **Dead letter queue**: Failed Lambda invocations go to an SQS DLQ for debugging. -- **Clean destroy**: A cleanup action terminates Lambda-created NAT instances before Terraform removes ENIs, ensuring clean `terraform destroy`. +- **EventBridge scope**: Captures all EC2 state changes in the account; Lambda filters by VPC ID. +- **Startup delay**: First workload in an idle AZ waits ~10 seconds for internet. Design scripts to retry outbound connections. +- **Dual ENI**: Persistent public + private ENIs survive stop/start cycles. +- **DLQ**: Failed Lambda invocations go to an SQS dead letter queue. +- **Clean destroy**: A cleanup action terminates NAT instances before `terraform destroy` removes ENIs. +- **Config versioning**: Changing AMI or instance type auto-replaces NAT instances on next workload event. ## Requirements @@ -220,7 +191,7 @@ No modules. ## Contributing -Contributions are welcome! Please open an issue or submit a pull request. +Contributions welcome. Please open an issue or submit a pull request. ## License diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index ddc93ef..1c87e31 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -1,233 +1,146 @@ # Architecture -## High-Level Overview +## Overview -nat-zero takes a fundamentally different approach to NAT on AWS. Instead of running infrastructure around the clock, it treats NAT as a **reactive service**: infrastructure that exists only when something needs it. - -The module deploys an EventBridge rule that watches for EC2 state changes, and a Go Lambda that orchestrates NAT instance lifecycles in response. No polling, no cron jobs, no always-on compute -- just event-driven reactions to what's actually happening in your VPC. +nat-zero uses a **reconciliation pattern** to manage NAT instance lifecycles. A single Lambda function (concurrency=1) observes the current state of an AZ and takes one action to converge toward desired state, then returns. The next event picks up where this one left off. ``` - DATA PLANE - ┌──────────────────────────────────────────────────────────────────┐ - │ │ - │ Private Subnet NAT Instance Public Subnet │ - │ ┌─────────────┐ ┌───────────────────┐ ┌───────────────┐ │ - │ │ Workload │ │ Linux Kernel │ │ Public ENI │ │ - │ │ Instance │───>│ iptables │───>│ (ens5) │──>│── Internet - │ │ │ │ MASQUERADE │ │ + EIP │ │ Gateway - │ └─────────────┘ └───────────────────┘ └───────────────┘ │ - │ │ Private ENI (ens6) │ - │ └──────────────────┘ │ - │ route 0.0.0.0/0 │ - └──────────────────────────────────────────────────────────────────┘ - - CONTROL PLANE - ┌──────────────────────────────────────────────────────────────────┐ - │ │ - │ ┌──────────────────┐ ┌──────────────────┐ │ - │ │ EventBridge │───>│ Lambda Function │ │ - │ │ EC2 State Change │ │ nat-zero │ │ - │ └──────────────────┘ │ concurrency = 1 │ │ - │ └────────┬─────────┘ │ - │ │ │ - │ ┌──────────────┼──────────────┐ │ - │ │ │ │ │ - │ v v v │ - │ start/stop allocate/ on failure │ - │ NAT instance release EIP ┌─────────┐ │ - │ │ SQS DLQ │ │ - │ └─────────┘ │ - └──────────────────────────────────────────────────────────────────┘ + EventBridge (EC2 state changes) + │ + ▼ + ┌─────────────────────────┐ + │ Lambda (concurrency=1) │ + │ │ + │ 1. Resolve AZ │ + │ 2. Observe state │ + │ 3. Take one action │ + │ 4. Return │ + └─────────────────────────┘ + │ + ┌────┴────┐ + ▼ ▼ + EC2 API EIP API + (NATs) (allocate/release) ``` -## Reconciliation Model +## Reconciliation Loop -The Lambda uses a **reconciliation pattern** with **reserved concurrency of 1** (single writer). Every invocation performs the same observe-compare-act loop regardless of which event triggered it: +Every invocation runs the same loop regardless of which event triggered it: -1. **Resolve**: determine the AZ from the trigger instance (or sweep all AZs if the instance is gone) -2. **Observe**: query workloads, NAT instances, and EIPs for that AZ -3. **Decide**: compare actual state to desired state -4. **Act**: take at most ONE mutating action, then return +``` +reconcile(az): + workloads = pending/running non-NAT instances in AZ + nats = non-terminated NAT instances in AZ + eips = EIPs tagged for this AZ + needNAT = len(workloads) > 0 -The next event picks up where this one left off. No waiting, no polling, no retries. + # One action per invocation, then return +``` -### Why Single Writer Eliminates Races +### Decision Matrix -With `reserved_concurrent_executions = 1`, only one Lambda invocation runs at a time. Events that arrive during execution are queued by the Lambda service and processed sequentially. This means: +| Workloads? | NAT State | EIP State | Action | +|:----------:|-----------|-----------|--------| +| Yes | None / shutting-down | — | **Create** NAT | +| Yes | Stopped | — | **Start** NAT | +| Yes | Stopping | — | Wait (no-op) | +| Yes | Outdated config | — | **Terminate** NAT (recreate on next event) | +| Yes | Running | No EIP | **Allocate + attach** EIP | +| Yes | Running | Has EIP | Converged | +| No | Running / pending | — | **Stop** NAT | +| No | Stopped | Has EIP | **Release** EIP | +| No | Stopped | No EIP | Converged | +| No | Stopping | — | Wait (no-op) | +| — | Multiple NATs | — | **Terminate** duplicates | +| — | — | Multiple EIPs | **Release** extras | -- No duplicate NAT creation (only one invocation can call `RunInstances`) -- No double EIP allocation (only one invocation can call `AllocateAddress`) -- No start/stop overlap (only one invocation can modify the NAT state) -- No need for re-check loops or retry logic +### Why Single Writer -### Reconciliation Logic +`reserved_concurrent_executions = 1` means only one Lambda runs at a time. Events that arrive during execution are queued and processed sequentially. This eliminates: -``` -reconcile(az, vpc): - workloads = findWorkloads(az, vpc) # pending/running, excluding NAT + ignored - nats = findNATs(az, vpc) # pending/running/stopping/stopped - eips = findEIPs(az) # tagged for this AZ - - # Deduplicate NATs (safety net for pre-existing duplicates) - if len(nats) > 1: terminateDuplicateNATs(nats) - - needNAT = len(workloads) > 0 - - # --- NAT convergence (one action per invocation) --- - if needNAT: - if no NAT or NAT terminating: createNAT → return - if NAT has outdated config: terminateNAT → return - if NAT stopped: startNAT → return - if NAT stopping: return (wait for next event) - # NAT pending or running — good - else: - if NAT running or pending: stopNAT (Force) → return - # NAT stopping/stopped/nil — good - - # --- EIP convergence --- - if NAT running and no EIPs: allocateAndAttachEIP → return - if NAT not running and EIPs exist: releaseEIPs → return - if multiple EIPs: releaseExtras → return - - # Converged — no action needed -``` +- Duplicate NAT creation +- Double EIP allocation +- Start/stop race conditions +- Need for distributed locking ### Event Agnosticism -The reconciler does NOT care whether the trigger event came from a NAT instance or a workload. There is no classify step that branches on instance type. - -- **Workload `pending` event** → resolveAZ → reconcile → creates/starts NAT if needed -- **NAT `running` event** → resolveAZ → reconcile → attaches EIP if needed -- **Workload `terminated` event** → resolveAZ → reconcile → stops NAT if no workloads -- **NAT `stopped` event** → resolveAZ → reconcile → releases EIP if present -- **Instance gone from API** → sweep all configured AZs → reconcile each +The reconciler does not care what type of instance triggered the event. It observes all workloads and NATs in the AZ, computes desired state, and acts. The event is just a signal that "something changed." -The event is just a signal that "something changed in this AZ." The reconciler always computes the correct answer from current state. +- Workload `pending` → reconcile → creates NAT if needed +- NAT `running` → reconcile → attaches EIP if needed +- Workload `terminated` → reconcile → stops NAT if no workloads +- NAT `stopped` → reconcile → releases EIP if present +- Instance gone from API → sweep all configured AZs -## Event Flow +## Event Flows -### Scale-up: Workload starts, NAT created +### Scale-up ``` -1. Workload → pending - reconcile: workloads=1, NAT=nil → createNAT +Workload launches (pending) + → reconcile: workloads=1, NAT=nil → createNAT -2. NAT → pending - reconcile: workloads=1, NAT=pending, EIPs=0 → converged (NAT not yet running) +NAT reaches running + → reconcile: workloads=1, NAT=running, EIP=0 → allocateAndAttachEIP -3. NAT → running - reconcile: workloads=1, NAT=running, EIPs=0 → allocateAndAttachEIP - Result: NAT has internet via EIP ✓ - -4. Workload → running - reconcile: workloads=1, NAT=running, EIPs=1 → converged ✓ +Next event + → reconcile: workloads=1, NAT=running, EIP=1 → converged ✓ ``` -### Scale-down: Workload terminates, NAT stopped +### Scale-down ``` -1. Workload → shutting-down - reconcile: workloads=0, NAT=running → stopNAT (Force=true) +Last workload terminates + → reconcile: workloads=0, NAT=running → stopNAT -2. NAT → stopping - reconcile: workloads=0, NAT=stopping → converged (waiting for stopped) +NAT reaches stopped + → reconcile: workloads=0, NAT=stopped, EIP=1 → releaseEIP -3. NAT → stopped - reconcile: workloads=0, NAT=stopped, EIPs=1 → releaseEIPs - Result: NAT idle, no EIP charge ✓ +Next event + → reconcile: workloads=0, NAT=stopped, EIP=0 → converged ✓ ``` -### Restart: New workload starts, stopped NAT restarted +### Restart ``` -1. New workload → pending - reconcile: workloads=1, NAT=stopped → startNAT +New workload launches, NAT is stopped + → reconcile: workloads=1, NAT=stopped → startNAT -2. NAT → pending → running - reconcile: workloads=1, NAT=running, EIPs=0 → allocateAndAttachEIP - Result: NAT has internet via EIP ✓ +NAT reaches running + → reconcile: workloads=1, NAT=running, EIP=0 → allocateAndAttachEIP + → converged ✓ ``` -### Terraform destroy +### Terraform Destroy ``` Terraform invokes Lambda with {action: "cleanup"} -Action: find all NAT instances → terminate → release all EIPs -Result: clean state for ENI/SG destruction ✓ + → terminate all NAT instances + → wait for full termination (ENI detachment) + → release all EIPs + → return (Terraform proceeds to delete ENIs/SGs) ``` ## Dual ENI Architecture -Each NAT instance uses two Elastic Network Interfaces (ENIs) to separate public and private traffic. ENIs are pre-created by Terraform and attached via the launch template, so they persist across instance stop/start cycles. +Each NAT instance uses two ENIs to separate public and private traffic: ``` - Private Subnet NAT Instance Public Subnet - ┌──────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐ - │ │ │ │ │ │ - │ Route Table │ │ ┌──────────────┐ │ │ │ - │ 0.0.0.0/0 ──────┼───>│ │ iptables │ │ │ │ - │ │ │ │ │ │ │ │ │ - │ v │ │ │ MASQUERADE │ │ │ │ - │ ┌────────────┐ │ │ │ on ens5 │ │ │ ┌────────────────┐ │ - │ │ Private ENI│ │ │ │ │───┼───>│ │ Public ENI │ │ - │ │ (ens6) │──┼───>│ │ FORWARD │ │ │ │ (ens5) │──┼──> Internet - │ │ │ │ │ │ ens6 → ens5 │ │ │ │ + EIP │ │ Gateway - │ │ No pub IP │ │ │ │ │ │ │ │ │ │ - │ │ src_dst=off│ │ │ │ RELATED, │ │ │ │ src_dst=off │ │ - │ └────────────┘ │ │ │ ESTABLISHED │ │ │ └────────────────┘ │ - │ │ │ └──────────────┘ │ │ │ - └──────────────────┘ └──────────────────────┘ └──────────────────────┘ + Private Subnet NAT Instance Public Subnet + ┌──────────────┐ ┌──────────────────┐ ┌──────────────────┐ + │ Route Table │ │ │ │ │ + │ 0.0.0.0/0 ───┼──→│ Private ENI │ │ Public ENI │ + │ │ │ (ens6) │ │ (ens5) + EIP │──→ IGW + │ │ │ ↓ iptables ──┼──→│ │ + │ │ │ MASQUERADE │ │ src_dst_check=off│ + └──────────────┘ └──────────────────┘ └──────────────────┘ ``` -Key design decisions: -- **Pre-created ENIs**: ENIs are Terraform-managed and referenced in the launch template. They survive instance stop/start, preserving route table entries. -- **source_dest_check=false**: Required on both ENIs for NAT to work (instance forwards packets not addressed to itself). -- **EIP lifecycle**: Elastic IPs are allocated when the NAT instance reaches "running" and released when it reaches "stopped", both managed by the reconciliation loop. This avoids charges for unused EIPs. - -## Comparison with fck-nat - -nat-zero builds on top of fck-nat -- it uses the same AMI and the same iptables-based NAT approach. The difference is the orchestration layer: instead of an always-on ASG, nat-zero uses event-driven Lambda to start and stop NAT instances on demand. +- **Pre-created by Terraform**: ENIs persist across stop/start cycles, keeping route tables intact +- **source_dest_check=false**: Required on both ENIs for NAT forwarding +- **EIP lifecycle**: Allocated on NAT running, released on NAT stopped — no charge when idle -``` - fck-nat (Always-On) nat-zero (Scale-to-Zero) - ┌────────────────────────────┐ ┌────────────────────────────────┐ - │ │ │ │ - │ ┌──────────────────────┐ │ │ ┌────────────┐ │ - │ │ Auto Scaling Group │ │ │ │ EventBridge │ │ - │ │ min=1, max=1 │ │ │ │ EC2 state │ │ - │ └──────────┬───────────┘ │ │ │ changes │ │ - │ │ │ │ └──────┬─────┘ │ - │ v │ │ │ │ - │ ┌──────────────────────┐ │ │ v │ - │ │ NAT Instance │ │ │ ┌────────────┐ │ - │ │ Always running │ │ │ │ Lambda │ │ - │ │ │ │ │ │ Reconciler │ │ - │ └──────────────────────┘ │ │ │ (conc = 1) │ │ - │ │ │ └──────┬─────┘ │ - │ Cost: ~$7-8/mo │ │ │ │ - │ (instance + EIP 24/7) │ │ v │ - │ Self-healing via ASG │ │ ┌────────────────────┐ │ - │ No Lambda needed │ │ │ NAT Instance │ │ - └────────────────────────────┘ │ │ Started on demand │ │ - │ │ Stopped when idle │ │ - │ └────────────────────┘ │ - │ │ - │ Cost: ~$0.80/mo (idle) │ - │ EIP released when stopped │ - │ Zero IPv4 charge when idle │ - └────────────────────────────────┘ -``` +## Config Versioning -Costs per AZ, per month. Includes the [AWS public IPv4 charge](https://aws.amazon.com/blogs/aws/new-aws-public-ipv4-address-charge-public-ip-insights/) (\$3.60/mo per public IP, effective Feb 2024). - -| Aspect | fck-nat | nat-zero | -|--------|---------|-------------------| -| Architecture | ASG with min=1 | Lambda + EventBridge | -| Idle cost | ~\$7-8/mo (instance + EIP 24/7) | ~\$0.80/mo (EBS only, no EIP) | -| Active cost | ~\$7-8/mo | ~\$7-8/mo (same) | -| Public IPv4 charge | \$3.60/mo always | \$0 when idle (EIP released) | -| Scale-to-zero | No | Yes | -| Self-healing | ASG replaces unhealthy | Lambda creates new on demand | -| AMI | fck-nat AMI | fck-nat AMI (same) | -| Complexity | Low (ASG only) | Moderate (Lambda + EventBridge) | -| Best for | Production 24/7 | Dev/staging, intermittent workloads | +The Lambda tags each NAT instance with a `ConfigVersion` hash derived from AMI, instance type, market type, and volume size. When a workload event arrives and the existing NAT has an outdated hash, the reconciler terminates it. The next event creates a replacement with the current config. diff --git a/docs/INDEX.md b/docs/INDEX.md index 1666240..4dc0102 100644 --- a/docs/INDEX.md +++ b/docs/INDEX.md @@ -2,68 +2,9 @@ **Scale-to-zero NAT instances for AWS.** Stop paying for NAT when nothing is running. -nat-zero is a Terraform module that brings event-driven, scale-to-zero NAT to your AWS VPCs. When a workload starts in a private subnet, a NAT instance spins up automatically. When the last workload stops, the NAT shuts down and its Elastic IP is released. You pay nothing while idle -- just ~\$0.80/mo for a stopped EBS volume. +nat-zero is a Terraform module that replaces always-on NAT with on-demand NAT instances. When a workload launches in a private subnet, a NAT instance starts automatically. When the last workload stops, the NAT shuts down and its Elastic IP is released. Idle cost: ~$0.80/month per AZ. -Built on [fck-nat](https://fck-nat.dev/) AMIs. Orchestrated by a Go Lambda with a 55 ms cold start. Proven by real integration tests that deploy infrastructure and verify connectivity end-to-end. - -``` - CONTROL PLANE - ┌──────────────────────────────────────────────────┐ - │ EventBridge ──> Lambda (NAT Orchestrator) │ - │ │ start/stop instances │ - │ │ allocate/release EIPs │ - └────────────────────┼─────────────────────────────┘ - │ - ┌────────────┴────────────┐ - v v - AZ-A (active) AZ-B (idle) - ┌──────────────────┐ ┌──────────────────┐ - │ Workloads │ │ No workloads │ - │ ↓ route table │ │ No NAT instance │ - │ Private ENI │ │ No EIP │ - │ ↓ │ │ │ - │ NAT Instance │ │ Cost: ~$0.80/mo │ - │ ↓ │ │ (EBS only) │ - │ Public ENI + EIP │ │ │ - │ ↓ │ └──────────────────┘ - │ Internet Gateway │ - └──────────────────┘ -``` - -## Why nat-zero? - -AWS NAT Gateway costs a minimum of ~\$36/month per AZ -- even if nothing is using it. fck-nat brings that down to ~\$7-8/month, but the instance and its public IP still run 24/7. - -**nat-zero takes it further.** When your private subnets are idle, there's no NAT instance running and no Elastic IP allocated. Your cost drops to the price of a stopped 2 GB EBS volume: about 80 cents a month. - -This matters most for: - -- **Dev and staging environments** that sit idle nights and weekends -- **CI/CD runners** that spin up for minutes, then disappear for hours -- **Batch and cron workloads** that run periodically -- **Side projects** where every dollar counts - -### Cost comparison (per AZ, per month) - -| State | nat-zero | fck-nat | NAT Gateway | -|-------|----------|---------|-------------| -| **Idle** (no workloads) | **~\$0.80** | ~\$7-8 | ~\$36+ | -| **Active** (workloads running) | ~\$7-8 | ~\$7-8 | ~\$36+ | - -The key: nat-zero **releases the Elastic IP when idle**, avoiding the [\$3.60/month public IPv4 charge](https://aws.amazon.com/blogs/aws/new-aws-public-ipv4-address-charge-public-ip-insights/) that fck-nat and NAT Gateway pay around the clock. - -## How it works - -An EventBridge rule watches for EC2 instance state changes in your VPC. A Lambda function reacts to each event: - -- **Workload starts** in a private subnet -- Lambda creates (or restarts) a NAT instance in that AZ and attaches an Elastic IP -- **Last workload stops** in an AZ -- Lambda stops the NAT instance and releases the Elastic IP -- **NAT instance reaches "running"** -- Lambda attaches an EIP to the public ENI -- **NAT instance reaches "stopped"** -- Lambda detaches and releases the EIP - -Each NAT instance uses two persistent ENIs (public + private) pre-created by Terraform. They survive stop/start cycles, so route tables stay intact and there's no need to reconfigure anything when a NAT comes back. - -See [Architecture](ARCHITECTURE.md) for detailed event flows and sequence diagrams. +Built on [fck-nat](https://fck-nat.dev/) AMIs. Orchestrated by a single Go Lambda (~55 ms cold start, 29 MB memory). Integration-tested against real AWS infrastructure on every PR. ## Quick start @@ -79,70 +20,20 @@ module "nat_zero" { private_route_table_ids = module.vpc.private_route_table_ids private_subnets_cidr_blocks = module.vpc.private_subnets_cidr_blocks - - tags = { - Environment = "dev" - } } ``` -See [Examples](EXAMPLES.md) for complete working configurations including spot instances, custom AMIs, and building from source. +## Cost comparison (per AZ, per month) -## Performance - -The orchestrator Lambda is written in Go and compiled to a native ARM64 binary. It was rewritten from Python to eliminate cold start overhead -- init latency dropped from 667 ms to 55 ms, a **90% improvement**. Peak memory usage went from 98 MB down to 30 MB. - -| Scenario | Time to connectivity | -|----------|---------------------| -| First workload in AZ (cold create) | ~15 seconds | -| NAT already running | Instant | -| Restart from stopped | ~12 seconds | - -The ~15 second cold-create time is dominated by EC2 instance boot and fck-nat AMI configuration -- not the Lambda. Subsequent workloads in the same AZ get connectivity immediately since the route table already points to the running NAT. - -See [Performance](PERFORMANCE.md) for detailed Lambda execution timings, instance type guidance, and cost breakdowns. - -## Tested against real infrastructure - -nat-zero isn't just unit-tested -- it's integration-tested against real AWS infrastructure on every PR. The test suite uses [Terratest](https://terratest.gruntwork.io/) to: - -1. Deploy the full module (Lambda, EventBridge, ENIs, security groups, launch templates) -2. Launch a workload instance and verify NAT creation with EIP -3. Verify the workload's egress IP matches the NAT's Elastic IP -4. Terminate the workload and verify NAT scale-down and EIP release -5. Launch a new workload and verify NAT restart -6. Run the cleanup action and verify all resources are removed -7. Tear down everything with `terraform destroy` - -The full lifecycle takes about 5 minutes in CI. See [Testing](TESTING.md) for phase-by-phase documentation. - -## When to use this module - -| Use case | nat-zero | fck-nat | NAT Gateway | -|----------|----------|---------|-------------| -| Dev/staging with intermittent workloads | **Best fit** | Wasteful | Very wasteful | -| Production 24/7 workloads | Overkill | **Best fit** | Simplest | -| Cost-sensitive environments | **Best fit** | Good | Expensive | -| Simplicity priority | More moving parts | **Simpler** | Simplest | - -**Use nat-zero** when your private subnet workloads run intermittently and you want to pay nothing when idle. - -**Use fck-nat** when workloads run 24/7 and you want simplicity with ASG self-healing. - -**Use NAT Gateway** when you prioritize managed simplicity and availability over cost. - -## Important notes - -- **EventBridge scope**: The rule captures all EC2 state changes in the account. The Lambda filters by VPC ID, so it only acts on instances in your target VPC. -- **Startup delay**: The first workload in an idle AZ waits ~15 seconds for internet. Design startup scripts to retry outbound connections -- most package managers already do. -- **Dual ENI**: Each AZ gets persistent public + private ENIs that survive instance stop/start cycles. -- **Dead letter queue**: Failed Lambda invocations go to an SQS DLQ for debugging. -- **Clean destroy**: A cleanup action terminates Lambda-created NAT instances before Terraform removes ENIs, ensuring clean `terraform destroy`. - -## Contributing - -Contributions are welcome! Please open an issue or submit a pull request. +| State | nat-zero | fck-nat | NAT Gateway | +|-------|----------|---------|-------------| +| **Idle** (no workloads) | **~$0.80** | ~$7-8 | ~$36+ | +| **Active** (workloads running) | ~$7-8 | ~$7-8 | ~$36+ | -## License +## Learn more -MIT +- [Architecture](ARCHITECTURE.md) — reconciliation model, decision matrix, event flows +- [Performance](PERFORMANCE.md) — startup latency, Lambda execution times, cost breakdowns +- [Examples](EXAMPLES.md) — spot instances, custom AMIs, building from source +- [Terraform Reference](REFERENCE.md) — inputs, outputs, resources +- [Testing](TESTING.md) — integration test lifecycle and CI diff --git a/docs/PERFORMANCE.md b/docs/PERFORMANCE.md index 9280d50..f7a5816 100644 --- a/docs/PERFORMANCE.md +++ b/docs/PERFORMANCE.md @@ -1,159 +1,98 @@ # Performance and Cost -nat-zero's orchestrator Lambda was rewritten from Python 3.11 to Go, compiled to a native ARM64 binary running on the `provided.al2023` runtime. The result: **90% faster cold starts**, 69% less memory, and faster end-to-end execution. All measurements below are from real integration tests running in us-east-1 with `t4g.nano` instances. +All measurements from real integration tests in us-east-1 with `t4g.nano` instances and 128 MB Lambda memory. ## Startup Latency -**First workload in an AZ — NAT created from scratch: ~15 seconds to connectivity.** +| Scenario | Time to connectivity | +|----------|---------------------| +| First workload (cold create) | **~10.7 s** | +| Restart from stopped | **~8.5 s** | +| NAT already running | **Instant** | + +### Cold create breakdown ``` - 0.0 s Workload instance enters "running" state - 0.3 s EventBridge delivers workload event to Lambda - 0.4 s Lambda cold start completes (55-67 ms init) - 0.9 s Lambda classifies instance, checks for existing NAT - 2.3 s Lambda calls RunInstances — NAT instance is now "pending" - Lambda returns. EIP will be attached separately via EventBridge. - -~12 s NAT instance reaches "running" state - (fck-nat AMI boots, configures iptables, attaches ENIs) -~12.3 s EventBridge delivers NAT "running" event to Lambda -~12.5 s Lambda allocates EIP and associates to public ENI - -~15 s Workload can reach the internet via NAT + 0.0 s Workload enters "pending" + 0.3 s EventBridge delivers event + 0.4 s Lambda cold start (55 ms init) + 0.9 s Reconcile: observe state, decide to create NAT + 2.3 s RunInstances returns — NAT is "pending" + Lambda returns. + +~8.0 s NAT reaches "running" (EC2 boot + fck-nat config) +~8.3 s EventBridge delivers NAT "running" event +~8.9 s Lambda: allocate EIP + associate (~3 s) + +~10.7 s Workload can reach the internet ``` -The ~10 second gap between `RunInstances` and NAT reaching "running" is spent on EC2 placement (~2-3 s), OS boot (~3-4 s), and fck-nat network configuration (~2-3 s). This is consistent across all instance types tested — the bottleneck is EC2's instance lifecycle, not CPU or memory. - -### Restart from stopped state: ~12 seconds +The ~8 second gap is EC2 instance lifecycle (placement, OS boot, iptables config) — not the Lambda. -When a stopped NAT is restarted (new workload arrives after previous scale-down): +### Restart breakdown ``` - 0.0 s New workload enters "running" state - 0.3 s EventBridge delivers workload event to Lambda - 0.4 s Lambda classifies, finds stopped NAT → calls StartInstances - Lambda returns. + 0.0 s New workload enters "pending" + 0.4 s Lambda finds stopped NAT → StartInstances + Lambda returns. -~10 s NAT instance reaches "running" state (reboot from stopped) -~10.3 s EventBridge delivers NAT "running" event to Lambda -~10.5 s Lambda allocates EIP and associates to public ENI +~6.0 s NAT reaches "running" (faster than cold create) +~6.3 s Lambda: allocate EIP + associate -~12 s Workload can reach the internet via NAT +~8.5 s Workload can reach the internet ``` -Restart is ~3 seconds faster than cold create because `StartInstances` is faster than `RunInstances` and skips AMI/launch template resolution. - -### NAT already running: instant +Restart is ~2 seconds faster — `StartInstances` skips AMI resolution and launch template processing. -If a NAT is already running in the AZ (e.g. second workload starts), no action is needed. The route table already points to the NAT's private ENI, so connectivity is immediate. +## Lambda Execution -### Summary table +| Metric | Value | +|--------|-------| +| Cold start (Init Duration) | 55 ms | +| Typical invocation | 400-600 ms | +| EIP allocation + association | ~3 s | +| Peak memory | 29-30 MB | +| Lambda memory allocation | 128 MB | -| Scenario | Lambda Duration | Time to NAT Running + EIP | Time to Connectivity | -|----------|-----------------|--------------------------|---------------------| -| First workload (cold create) | ~2 s | ~12 s | **~15 s** | -| NAT already running | — | — | **0 s** | -| Restart from stopped | ~0.5 s | ~10 s | **~12 s** | -| Config outdated (replace) | ~60+ s | ~12 s | **~70 s** | +The Lambda is a compiled Go ARM64 binary on `provided.al2023`. No interpreter, no framework — just direct AWS SDK calls. ## Scale-Down Timing -When the last workload in an AZ stops or terminates: - ``` - 0.0 s Last workload enters "shutting-down" state - 0.3 s EventBridge delivers workload event to Lambda - 0.4 s Lambda classifies, finds NAT, checks for sibling workloads - 4.5 s No siblings after 3 retries (2 s apart) → calls StopInstances - Lambda returns. + 0.0 s Last workload enters "shutting-down" + 0.3 s EventBridge delivers event + 0.5 s Lambda: reconcile → workloads=0, NAT running → stopNAT + Lambda returns. -~15 s NAT instance reaches "stopped" state -~15.3 s EventBridge delivers NAT "stopped" event to Lambda -~15.5 s Lambda disassociates and releases EIP +~10 s NAT reaches "stopped" +~10.3 s EventBridge delivers NAT "stopped" event +~10.5 s Lambda: release EIP -~16 s EIP released, no IPv4 charge +~11 s EIP released, no IPv4 charge ``` -The 3x retry with 2-second delays (~4 seconds total) is a safety margin to prevent flapping when instances are being replaced. The Lambda only checks for `pending` or `running` siblings — stopping or terminated instances don't count. - -## Lambda Execution - -The Lambda is a compiled Go binary on the `provided.al2023` runtime with 256 MB memory. - -| Metric | Duration | Notes | -|--------|----------|-------| -| Cold start (Init Duration) | 55-67 ms | Go binary; no interpreter overhead | -| classify (DescribeInstances) | 100-700 ms | Single API call; varies with API latency | -| findNAT (DescribeInstances) | 65-100 ms | Filter by tag + AZ + VPC | -| resolveAMI (DescribeImages) | 60-120 ms | Sorts by creation date | -| resolveLT (DescribeLaunchTemplates) | 70-100 ms | Filter by AZ + VPC tags | -| RunInstances | 1.2-1.6 s | AWS API latency | -| attachEIP (Allocate + Associate) | 150-300 ms | Includes idempotency check | -| detachEIP (Disassociate + Release) | 100-200 ms | Includes idempotency check | -| **Scale-up handler total** | **~2 s** | classify + findNAT + createNAT | -| **Scale-down handler total** | **~5 s** | classify + findNAT + 3x findSiblings + stopNAT | -| **attachEIP handler total** | **~0.5 s** | classify + waitForState + attachEIP | -| **detachEIP handler total** | **~0.5 s** | classify + waitForState + detachEIP | - -### Why Go? - -The original Lambda was written in Python 3.11. It worked, but Python's interpreter overhead meant a 667 ms cold start and 98 MB memory footprint -- meaningful for a function that might be invoked dozens of times during a busy scaling period. - -Rewriting in Go and compiling to a native binary eliminated the interpreter entirely: - -| Metric | Python 3.11 (128 MB) | Go (256 MB) | Improvement | -|--------|----------------------|-------------|-------------| -| Cold start | 667 ms | 55-67 ms | **~90% faster** | -| Handler total (scale-up) | 2,439 ms | ~2,000 ms | **~18% faster** | -| Max memory used | 98 MB | 30 MB | **69% less** | - -The Go binary is ~4 MB, boots in under 70 ms, and the entire scale-up path completes in about 2 seconds. For a Lambda that runs on every EC2 state change in your account, that matters. - -## What This Means for Your Workloads - -- **First workload takes ~15 seconds to get internet.** Design startup scripts to retry outbound connections (e.g. `apt update`, `pip install`, `curl`). Most package managers already retry. -- **Subsequent workloads are instant.** Once a NAT is running in an AZ, the route table already points to it. -- **Restart after idle is ~12 seconds.** If your workloads run sporadically (CI jobs, cron tasks), expect a ~12 second delay when the first job starts after an idle period. -- **Scale-down is conservative.** The Lambda waits 6 seconds (3 retries) before stopping a NAT, preventing flapping during instance replacements. -- **Instance type doesn't affect startup time.** The ~10 second EC2 boot time is the same for `t4g.nano` and `c7gn.medium`. - ## Cost -Per AZ, per month. All prices are us-east-1 on-demand. Includes the [AWS public IPv4 charge](https://aws.amazon.com/blogs/aws/new-aws-public-ipv4-address-charge-public-ip-insights/) (\$0.005/hr per public IP). - -### Idle vs active +Per AZ, per month. us-east-1 on-demand prices. Includes the [$3.60/month public IPv4 charge](https://aws.amazon.com/blogs/aws/new-aws-public-ipv4-address-charge-public-ip-insights/). | State | nat-zero | fck-nat | NAT Gateway | |-------|----------|---------|-------------| -| **Idle** (no workloads) | **~\$0.80** | ~\$7-8 | ~\$36+ | -| **Active** (workloads running) | ~\$7-8 | ~\$7-8 | ~\$36+ | - -**Idle breakdown**: EBS volume only (~\$0.80/mo for 2 GB gp3). No instance running, no EIP allocated. - -**Active breakdown**: t4g.nano instance (\$3.07/mo) + EIP (\$3.60/mo) + EBS (\$0.80/mo) = ~\$7.50/mo. - -The key difference: nat-zero **releases the EIP when idle**, saving the \$3.60/mo public IPv4 charge that fck-nat and NAT Gateway pay 24/7. - -### Instance type options - -| Instance Type | vCPUs | RAM | Network | \$/hour | \$/month (24x7) | \$/month (12hr/day) | -|---------------|-------|-----|---------|--------|---------------|-------------------| -| **t4g.nano** (default) | 2 | 0.5 GiB | Up to 5 Gbps | \$0.0042 | \$3.07 | \$1.53 | -| t4g.micro | 2 | 1 GiB | Up to 5 Gbps | \$0.0084 | \$6.13 | \$3.07 | -| t4g.small | 2 | 2 GiB | Up to 5 Gbps | \$0.0168 | \$12.26 | \$6.13 | -| c7gn.medium | 1 | 2 GiB | Up to 25 Gbps | \$0.0624 | \$45.55 | \$22.78 | +| **Idle** | **~$0.80** | ~$7-8 | ~$36+ | +| **Active** | ~$7-8 | ~$7-8 | ~$36+ | -Spot pricing typically offers 60-70% savings on t4g instances. Use `market_type = "spot"` to enable. +**Idle**: EBS volume only (~$0.80 for 2 GB gp3). No instance, no EIP. -### Choosing an instance type +**Active**: t4g.nano ($3.07) + EIP ($3.60) + EBS ($0.80) = ~$7.50. -**t4g.nano** (default) is right for most workloads: -- Handles typical dev/staging NAT traffic -- Burstable up to 5 Gbps with CPU credits -- \$3/month on-demand, ~\$1/month on spot +### Instance types -**t4g.micro / t4g.small** — consider if you need sustained throughput beyond t4g.nano's baseline or workloads transfer large volumes consistently. +| Type | Network | $/month (24x7) | $/month (12hr/day) | +|------|---------|:--------------:|:------------------:| +| **t4g.nano** (default) | Up to 5 Gbps | $3.07 | $1.53 | +| t4g.micro | Up to 5 Gbps | $6.13 | $3.07 | +| t4g.small | Up to 5 Gbps | $12.26 | $6.13 | +| c7gn.medium | Up to 25 Gbps | $45.55 | $22.78 | -**c7gn.medium** — consider if you need consistently high network throughput (up to 25 Gbps). At \$45/month it's still cheaper than NAT Gateway for most data transfer patterns. +Spot pricing typically offers 60-70% savings. Use `market_type = "spot"`. -Instance type does **not** affect startup time (~12 s regardless), only maximum sustained throughput and monthly cost. +**t4g.nano** handles typical dev/staging traffic. Instance type does not affect startup time — the bottleneck is EC2 lifecycle, not CPU. diff --git a/docs/TESTING.md b/docs/TESTING.md index 61d1f28..8ce0bec 100644 --- a/docs/TESTING.md +++ b/docs/TESTING.md @@ -1,169 +1,67 @@ -# Integration Tests +# Testing -nat-zero is tested against real AWS infrastructure, not mocks. The integration test suite deploys the full module into a live AWS account, launches actual EC2 workloads, verifies that NAT instances come up with working internet connectivity, exercises scale-down and restart, then tears everything down cleanly. +nat-zero is integration-tested against real AWS infrastructure on every PR. The test deploys the full module, exercises the complete NAT lifecycle, then tears everything down. -These tests run in CI on every PR (triggered by adding the `integration-test` label) and take about 5 minutes end-to-end. They use [Terratest](https://terratest.gruntwork.io/) (Go) and run against `us-east-1`. +## Running Tests -## Test Fixture - -The Terraform fixture at `tests/integration/fixture/main.tf` creates: - -- A **private subnet** (`172.31.128.0/24`) in the account's default VPC -- A **route table** and association for that subnet -- The **nat_zero module** (`name = "nat-test"`) wired to the private subnet and a default public subnet - -All module resources (Lambda, EventBridge, ENIs, security groups, launch templates, IAM roles) are created inside the module. - -## TestNatZero - -A single test that exercises the full NAT lifecycle in four phases using subtests, with one `terraform apply` / `destroy` cycle. Each phase records wall-clock timing for the [TIMING SUMMARY](#timing-summary) printed at the end of the test. - -### Setup - -1. **Create workload IAM profile** — An IAM role/profile (`nat-test-wl-tt-`) is created that allows the workload instance to call `ec2:CreateTags` on itself. This lets the user-data script tag the instance with its egress IP. The profile is deferred for deletion at the end of the test. - -2. **Terraform apply** — Runs `terraform init` and `terraform apply` on the fixture. This creates the private subnet, route table, and the entire nat_zero module (Lambda, EventBridge rule, ENIs, security groups, launch template, IAM roles). `terraform destroy` is deferred for cleanup. - -3. **Read Terraform outputs** — Captures `vpc_id`, `private_subnet_id`, and `lambda_function_name` from the Terraform state. - -4. **Register cleanup handlers** — Defers workload instance termination and a Lambda log dumper that prints CloudWatch logs if the test fails. - -### Phase 1: NATCreationAndConnectivity - -Verifies the scale-up path: workload starts, NAT comes up with an EIP, workload reaches the internet through the NAT. - -1. **Launch workload instance** — Launches a `t4g.nano` EC2 instance in the private subnet with a user-data script. The script retries `curl https://checkip.amazonaws.com` every 2 seconds until the NAT provides internet, then tags the instance with `EgressIP=`. - -2. **Invoke Lambda** — Calls the Lambda with `{"instance_id": "", "state": "running"}`, bypassing EventBridge for reliability. This triggers `createNAT` (RunInstances). - -3. **Wait for NAT with EIP** — Polls every 2 seconds for a NAT instance that is running with an EIP on its public ENI (device index 0). The EIP is attached by a separate Lambda invocation triggered by the NAT's "running" EventBridge event. - -4. **Validate NAT configuration** — Asserts: - - NAT has the `nat-zero:managed=true` tag - - NAT has dual ENIs at device index 0 (public) and 1 (private) - - A `0.0.0.0/0` route exists pointing to the NAT's private ENI - -5. **Verify workload connectivity** — Polls for the workload's `EgressIP` tag. Asserts the egress IP matches the NAT's EIP. - -### Phase 2: NATScaleDown - -Verifies the scale-down path: workload terminates, NAT stops, EIP is released. - -1. **Terminate workload** — Terminates the Phase 1 workload and waits for termination. - -2. **Invoke Lambda (scale-down)** — Calls the Lambda with `{"instance_id": "", "state": "terminated"}`. This triggers `maybeStopNAT` → 3x sibling check → `stopNAT` (StopInstances). - -3. **Wait for NAT stopped** — Polls until the NAT reaches `stopped` state. - -4. **Invoke Lambda (detach EIP)** — Calls the Lambda with `{"instance_id": "", "state": "stopped"}` to simulate the EventBridge event. This triggers `detachEIP` → DisassociateAddress + ReleaseAddress. - -5. **Verify EIP released** — Polls until no EIPs tagged `nat-zero:managed=true` remain. - -### Phase 3: NATRestart - -Verifies the restart path: new workload starts, stopped NAT is restarted with a new EIP, workload gets connectivity. - -1. **Launch new workload** — New `t4g.nano` in the private subnet. - -2. **Invoke Lambda (restart)** — Calls the Lambda with `{"instance_id": "", "state": "running"}`. This triggers `ensureNAT` → finds stopped NAT → `startNAT` (StartInstances). - -3. **Wait for NAT with EIP** — Polls until the NAT is running with a new EIP (attached via EventBridge). - -4. **Verify connectivity** — Polls for the new workload's `EgressIP` tag and confirms internet access. +```bash +# Unit tests (Lambda logic) +cd cmd/lambda && go test -v -race ./... -### Phase 4: CleanupAction +# Integration tests (requires AWS credentials) +cd tests/integration && go test -v -timeout 30m +``` -Verifies the destroy-time cleanup action works correctly. +Integration tests require AWS credentials with permissions to manage EC2, IAM, Lambda, EventBridge, and CloudWatch resources. -1. **Count EIPs** — Asserts at least one NAT EIP exists before cleanup. +## Integration Test Lifecycle -2. **Invoke cleanup** — Calls the Lambda with `{"action": "cleanup"}`. The Lambda terminates all NAT instances and releases all EIPs. +The test uses [Terratest](https://terratest.gruntwork.io/) with a single `terraform apply` / `destroy` cycle and four phases: -3. **Verify resources cleaned** — Polls until no running NAT instances and no NAT EIPs remain. +### Phase 1: NAT Creation and Connectivity -### Teardown (deferred, runs in LIFO order) +1. Deploy fixture (private subnet + nat-zero module in default VPC) +2. Launch workload instance in private subnet +3. Invoke Lambda → creates NAT instance +4. Wait for NAT running with EIP attached +5. Verify workload's egress IP matches NAT's EIP -1. Lambda log dump (only on failure) -2. Terminate test workload instances and wait -3. `terraform destroy` — removes all Terraform-managed resources -4. Delete workload IAM profile +### Phase 2: Scale-Down -## Timing Summary +1. Terminate workload +2. Invoke Lambda → stops NAT +3. Wait for NAT stopped +4. Invoke Lambda → releases EIP +5. Verify no EIPs remain -The test prints a timing summary at the end showing wall-clock duration of each phase: +### Phase 3: Restart -``` -=== TIMING SUMMARY === - PHASE DURATION - ------------------------------------------------------------ - IAM profile creation 1.234s - Terraform init+apply 45.678s - Launch workload instance 0.890s - Lambda invoke (scale-up) 2.345s - Wait for NAT running with EIP 14.567s - Wait for workload egress IP 25.890s - Terminate workload instance 30.123s - Lambda invoke (scale-down) 5.456s - Wait for NAT stopped 45.678s - Lambda invoke (detach EIP) 1.234s - Wait for EIP released 2.345s - Launch workload instance (restart) 0.890s - Lambda invoke (restart) 0.567s - Wait for NAT restarted with EIP 12.345s - Wait for workload egress IP (restart) 20.123s - Lambda invoke (cleanup) 45.678s - Wait for NAT terminated 5.678s - Wait for EIPs released 1.234s - Terraform destroy 60.123s - ------------------------------------------------------------ - TOTAL 5m15.678s -=== END TIMING SUMMARY === -``` +1. Launch new workload +2. Invoke Lambda → restarts stopped NAT +3. Wait for NAT running with new EIP +4. Verify connectivity -Key timings to watch: -- **Wait for NAT running with EIP**: How long from Lambda invocation to NAT with internet (cold create). Expect ~14 s. -- **Wait for NAT restarted with EIP**: Same metric for restart path. Expect ~12 s. -- **Lambda invoke (scale-down)**: Includes the 3x sibling retry (~4 s). Expect ~5 s. +### Phase 4: Cleanup Action -## TestNoOrphanedResources +1. Invoke Lambda with `{action: "cleanup"}` +2. Verify all NAT instances terminated and EIPs released -Runs after the main test. Searches for AWS resources with the `nat-test` prefix that were left behind by failed test runs. Checks for: +### Teardown -- Subnet with test CIDR (`172.31.128.0/24`) -- ENIs, security groups, and launch templates named `nat-test-*` -- EventBridge rules named `nat-test-*` -- Lambda function `nat-test-nat-zero` -- CloudWatch log group `/aws/lambda/nat-test-*` -- IAM roles and instance profiles prefixed `nat-test` -- EIPs tagged `nat-zero:managed=true` +`terraform destroy` removes all Terraform-managed resources. The cleanup action (Phase 4) ensures Lambda-created NAT instances are terminated first, so ENI deletion succeeds. -If any are found, the test fails and lists them for manual cleanup. +## CI -## Why the Cleanup Action Matters +Integration tests run in GitHub Actions when the `integration-test` label is added to a PR. They use OIDC to assume an AWS role in a dedicated test account. -NAT instances and EIPs are created by the Lambda at runtime, not by Terraform. During `terraform destroy`, Terraform doesn't know these exist. Without the cleanup action: +- Concurrency: one test at a time (`cancel-in-progress: false`) +- Timeout: 15 minutes +- Region: us-east-1 -1. `terraform destroy` tries to delete ENIs -2. ENIs are still attached to running NAT instances -3. Deletion fails, leaving the entire stack half-destroyed +## Orphan Detection -The `aws_lambda_invocation.cleanup` resource invokes the Lambda with `{"action": "cleanup"}` during destroy, which terminates instances and releases EIPs before Terraform tries to remove ENIs and security groups. +`TestNoOrphanedResources` runs after the main test and checks for leftover AWS resources with the `nat-test` prefix (subnets, ENIs, security groups, Lambda functions, IAM roles, EIPs). If any are found, it fails and lists them for manual cleanup. ## Config Version Replacement -The Lambda tracks a `CONFIG_VERSION` hash (derived from AMI, instance type, market type, and volume size). When a workload scales up and the existing NAT has an outdated `ConfigVersion` tag, the Lambda: - -1. Terminates the outdated NAT instance -2. Waits for the ENIs to become available -3. Creates a new NAT instance with the current config - -This ensures AMI or instance type changes propagate to NAT instances without manual intervention. - -## Running Locally - -```bash -cd nat-zero/tests/integration -go test -v -timeout 30m -``` - -Requires AWS credentials with permissions to create/destroy all resources in the fixture (EC2, IAM, Lambda, EventBridge, CloudWatch). +The Lambda tags NAT instances with a `ConfigVersion` hash (AMI + instance type + market type + volume size). When the config changes and a workload triggers reconciliation, the Lambda terminates the outdated NAT and creates a replacement. The integration test doesn't exercise this path directly, but it's covered by unit tests. diff --git a/mkdocs.yml b/mkdocs.yml index b008e3f..7947405 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -9,6 +9,6 @@ nav: - Home: INDEX.md - Architecture: ARCHITECTURE.md - Performance: PERFORMANCE.md + - Examples: EXAMPLES.md - Terraform Reference: REFERENCE.md - Testing: TESTING.md - - Examples: EXAMPLES.md From cfd47c9f693b5b521220728d674f76407915955f Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Thu, 26 Feb 2026 10:24:43 +1000 Subject: [PATCH 21/30] fix: increase Lambda timeout from 30s to 60s The cleanup action's waitForTermination can exceed 30s when polling for NAT instance termination during terraform destroy. This caused Sandbox.Timedout errors and left orphaned resources. Co-Authored-By: Claude Opus 4.5 --- lambda.tf | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lambda.tf b/lambda.tf index 96d311f..55ba82b 100644 --- a/lambda.tf +++ b/lambda.tf @@ -59,7 +59,7 @@ resource "aws_lambda_function" "nat_zero" { runtime = "provided.al2023" source_code_hash = fileexists("${path.module}/.build/lambda.zip") ? filebase64sha256("${path.module}/.build/lambda.zip") : null architectures = ["arm64"] - timeout = 30 + timeout = 60 reserved_concurrent_executions = 1 memory_size = var.lambda_memory_size tags = local.common_tags From 7f457a17456b70f827f815b598edebffd1839394 Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Thu, 26 Feb 2026 10:39:07 +1000 Subject: [PATCH 22/30] fix: terminate workloads before cleanup in integration test MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The test's Phase 4 (CleanupAction) invoked the cleanup Lambda while the Phase 3 workload was still running. When cleanup terminated the NAT, EventBridge delivered the terminated event, and the reconciler saw workloads=1 and created a new NAT. This zombie NAT then caused terraform destroy's cleanup invocation to timeout waiting for termination. Fix: terminate all test workloads before invoking cleanup, matching the production destroy ordering where Terraform deletes the EventBridge target (stopping new events) before running cleanup. Also increase Lambda timeout to 120s — the cleanup path with waitForTermination genuinely needs >30s (observed 32s in CI). Co-Authored-By: Claude Opus 4.5 --- lambda.tf | 2 +- tests/integration/nat_zero_test.go | 31 ++++++++++++++++++++++++++++++ 2 files changed, 32 insertions(+), 1 deletion(-) diff --git a/lambda.tf b/lambda.tf index 55ba82b..7baebe1 100644 --- a/lambda.tf +++ b/lambda.tf @@ -59,7 +59,7 @@ resource "aws_lambda_function" "nat_zero" { runtime = "provided.al2023" source_code_hash = fileexists("${path.module}/.build/lambda.zip") ? filebase64sha256("${path.module}/.build/lambda.zip") : null architectures = ["arm64"] - timeout = 60 + timeout = 120 reserved_concurrent_executions = 1 memory_size = var.lambda_memory_size tags = local.common_tags diff --git a/tests/integration/nat_zero_test.go b/tests/integration/nat_zero_test.go index a96fc19..ec42ecb 100644 --- a/tests/integration/nat_zero_test.go +++ b/tests/integration/nat_zero_test.go @@ -369,6 +369,37 @@ func TestNatZero(t *testing.T) { // ── Phase 4: Cleanup action ───────────────────────────────────────── t.Run("CleanupAction", func(t *testing.T) { + // Terminate all test workloads before cleanup to match production + // destroy ordering. In production, Terraform deletes the EventBridge + // target before invoking cleanup, so no new events fire. In the test, + // EventBridge is still active — if workloads are running when cleanup + // terminates the NAT, the terminated-event triggers reconcile which + // sees workloads and creates a new NAT. + termWlStart := time.Now() + wlOut, err := ec2Client.DescribeInstances(&ec2.DescribeInstancesInput{ + Filters: []*ec2.Filter{ + {Name: aws.String(fmt.Sprintf("tag:%s", testTagKey)), Values: []*string{aws.String(runID)}}, + {Name: aws.String("instance-state-name"), Values: []*string{ + aws.String("pending"), aws.String("running"), + aws.String("stopping"), aws.String("stopped"), + }}, + }, + }) + require.NoError(t, err) + var wlIDs []*string + for _, r := range wlOut.Reservations { + for _, i := range r.Instances { + wlIDs = append(wlIDs, i.InstanceId) + } + } + if len(wlIDs) > 0 { + t.Logf("Terminating %d workload(s) before cleanup", len(wlIDs)) + _, err := ec2Client.TerminateInstances(&ec2.TerminateInstancesInput{InstanceIds: wlIDs}) + require.NoError(t, err) + ec2Client.WaitUntilInstanceTerminated(&ec2.DescribeInstancesInput{InstanceIds: wlIDs}) + } + record("Terminate workloads before cleanup", time.Since(termWlStart)) + // Count EIPs tagged by the Lambda before cleanup. addrOut, err := ec2Client.DescribeAddresses(&ec2.DescribeAddressesInput{ Filters: []*ec2.Filter{ From d9bb51c0fdb91a6768a2422e2c461cd60c01e0ab Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Thu, 26 Feb 2026 10:52:23 +1000 Subject: [PATCH 23/30] fix: don't wait for full workload termination in cleanup test MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The previous fix waited for WaitUntilInstanceTerminated (90+ seconds), during which EventBridge drove normal scale-down — releasing the EIP before the test could assert it existed. Now we just wait for workloads to leave pending/running state (a few seconds), which is enough to prevent the reconciler from recreating NATs after cleanup. Removed the pre-cleanup EIP count assertion since it's not what Phase 4 tests — the point is verifying cleanup terminates NATs and releases EIPs, not that they exist beforehand. Co-Authored-By: Claude Opus 4.5 --- tests/integration/nat_zero_test.go | 51 ++++++++++++++++++++---------- 1 file changed, 35 insertions(+), 16 deletions(-) diff --git a/tests/integration/nat_zero_test.go b/tests/integration/nat_zero_test.go index ec42ecb..842578f 100644 --- a/tests/integration/nat_zero_test.go +++ b/tests/integration/nat_zero_test.go @@ -370,11 +370,10 @@ func TestNatZero(t *testing.T) { t.Run("CleanupAction", func(t *testing.T) { // Terminate all test workloads before cleanup to match production - // destroy ordering. In production, Terraform deletes the EventBridge - // target before invoking cleanup, so no new events fire. In the test, - // EventBridge is still active — if workloads are running when cleanup - // terminates the NAT, the terminated-event triggers reconcile which - // sees workloads and creates a new NAT. + // destroy ordering where Terraform deletes the EventBridge target + // (stopping new events) before invoking the cleanup Lambda. + // Without this, EventBridge delivers NAT terminated events to the + // reconciler which sees running workloads and creates new NATs. termWlStart := time.Now() wlOut, err := ec2Client.DescribeInstances(&ec2.DescribeInstancesInput{ Filters: []*ec2.Filter{ @@ -396,20 +395,19 @@ func TestNatZero(t *testing.T) { t.Logf("Terminating %d workload(s) before cleanup", len(wlIDs)) _, err := ec2Client.TerminateInstances(&ec2.TerminateInstancesInput{InstanceIds: wlIDs}) require.NoError(t, err) - ec2Client.WaitUntilInstanceTerminated(&ec2.DescribeInstancesInput{InstanceIds: wlIDs}) + // Wait until workloads leave pending/running so the reconciler + // won't see them as active. Don't wait for full termination + // (which takes 90+ seconds) — shutting-down is sufficient. + retry.DoWithRetry(t, "workloads not active", 30, 2*time.Second, func() (string, error) { + active := findWorkloadsInState(t, ec2Client, vpcID, runID, []string{"pending", "running"}) + if len(active) > 0 { + return "", fmt.Errorf("still %d active workloads", len(active)) + } + return "OK", nil + }) } record("Terminate workloads before cleanup", time.Since(termWlStart)) - // Count EIPs tagged by the Lambda before cleanup. - addrOut, err := ec2Client.DescribeAddresses(&ec2.DescribeAddressesInput{ - Filters: []*ec2.Filter{ - {Name: aws.String(fmt.Sprintf("tag:%s", natTagKey)), - Values: []*string{aws.String(natTagValue)}}, - }, - }) - require.NoError(t, err) - require.Greater(t, len(addrOut.Addresses), 0, "should have at least one NAT EIP before cleanup") - t.Log("Invoking Lambda with cleanup action...") cleanupStart := time.Now() invokeLambda(t, lambdaClient, lambdaName, map[string]string{"action": "cleanup"}) @@ -631,6 +629,27 @@ func findNATInstancesInState(t *testing.T, c *ec2.EC2, vpcID string, states []st return res } +func findWorkloadsInState(t *testing.T, c *ec2.EC2, vpcID, runID string, states []string) []*ec2.Instance { + t.Helper() + stateValues := make([]*string, len(states)) + for i, s := range states { + stateValues[i] = aws.String(s) + } + out, err := c.DescribeInstances(&ec2.DescribeInstancesInput{ + Filters: []*ec2.Filter{ + {Name: aws.String("vpc-id"), Values: []*string{aws.String(vpcID)}}, + {Name: aws.String("instance-state-name"), Values: stateValues}, + {Name: aws.String(fmt.Sprintf("tag:%s", testTagKey)), Values: []*string{aws.String(runID)}}, + }, + }) + require.NoError(t, err) + var res []*ec2.Instance + for _, r := range out.Reservations { + res = append(res, r.Instances...) + } + return res +} + func launchWorkload(t *testing.T, c *ec2.EC2, subnet, ami, runID, profile, queueURL string) string { t.Helper() out, err := c.RunInstances(&ec2.RunInstancesInput{ From ba6a7b3ce07fb35eecd271050d16b3d9fa3551e6 Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Thu, 26 Feb 2026 11:32:14 +1000 Subject: [PATCH 24/30] fix: add EventBridge propagation delay after target creation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit AWS EventBridge rules/targets are eventually consistent — events that fire within seconds of target creation may be silently dropped. The Lambda permission also needs time to propagate. Previously, terraform apply could complete and workloads could launch before EventBridge was ready to deliver events to the Lambda. This caused the first workload's pending/running events to be lost, leaving the NAT uncreated until a later event (like workload termination) happened to trigger reconciliation. Fix: add a 15-second time_sleep after the EventBridge target and Lambda permission are created. Module outputs depend on this sleep, so terraform apply doesn't return until the event pipeline is ready. Also reduce Lambda timeout from 120s to 90s (cleanup path with waitForTermination needs ~35s in practice). Co-Authored-By: Claude Opus 4.5 --- eventbridge.tf | 12 ++++++++++++ lambda.tf | 2 +- outputs.tf | 3 +++ 3 files changed, 16 insertions(+), 1 deletion(-) diff --git a/eventbridge.tf b/eventbridge.tf index 2e11afa..dd68318 100644 --- a/eventbridge.tf +++ b/eventbridge.tf @@ -37,3 +37,15 @@ resource "aws_cloudwatch_event_target" "state_change_lambda_target" { EOF } } + +# Wait for EventBridge target and Lambda permission to propagate. +# AWS EventBridge rules/targets are eventually consistent — events that +# fire within seconds of target creation may be silently dropped. +# See: https://docs.aws.amazon.com/eventbridge/latest/APIReference/API_PutTargets.html +resource "time_sleep" "eventbridge_propagation" { + depends_on = [ + aws_cloudwatch_event_target.state_change_lambda_target, + aws_lambda_permission.allow_ec2_state_change_eventbridge, + ] + create_duration = "15s" +} diff --git a/lambda.tf b/lambda.tf index 7baebe1..3f219c1 100644 --- a/lambda.tf +++ b/lambda.tf @@ -59,7 +59,7 @@ resource "aws_lambda_function" "nat_zero" { runtime = "provided.al2023" source_code_hash = fileexists("${path.module}/.build/lambda.zip") ? filebase64sha256("${path.module}/.build/lambda.zip") : null architectures = ["arm64"] - timeout = 120 + timeout = 90 reserved_concurrent_executions = 1 memory_size = var.lambda_memory_size tags = local.common_tags diff --git a/outputs.tf b/outputs.tf index 4fade31..9a1872c 100644 --- a/outputs.tf +++ b/outputs.tf @@ -1,11 +1,13 @@ output "lambda_function_arn" { description = "ARN of the nat-zero Lambda function" value = aws_lambda_function.nat_zero.arn + depends_on = [time_sleep.eventbridge_propagation] } output "lambda_function_name" { description = "Name of the nat-zero Lambda function" value = aws_lambda_function.nat_zero.function_name + depends_on = [time_sleep.eventbridge_propagation] } output "nat_security_group_ids" { @@ -31,4 +33,5 @@ output "launch_template_ids" { output "eventbridge_rule_arn" { description = "ARN of the EventBridge rule capturing EC2 state changes" value = aws_cloudwatch_event_rule.ec2_state_change.arn + depends_on = [time_sleep.eventbridge_propagation] } From 353b8e6898516d6366597d481c1461098a913e92 Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Thu, 26 Feb 2026 12:22:54 +1000 Subject: [PATCH 25/30] fix: handle EC2 eventual consistency in NAT event processing MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When a NAT is created, EventBridge fires events immediately but EC2 API queries (both filter-based and by-ID) can return stale data for several seconds. This caused two failure modes: 1. findNATs() returned empty on NAT pending event → Lambda tried to create duplicate NAT → failed with "ENI in use" error 2. NAT showed "pending" in API when actually "running" → Lambda logged "waiting" and returned → EIP never attached Fix: - Pass trigger instance from resolveAZ() to reconcile() - If trigger is a NAT that findNATs() missed, add it to the list - If event says "running" but API says "pending", trust the event Co-Authored-By: Claude Opus 4.5 --- cmd/lambda/ec2ops_test.go | 30 ++++++++++---------- cmd/lambda/handler.go | 57 +++++++++++++++++++++++++------------- cmd/lambda/handler_test.go | 36 ++++++++++++++++++++++++ 3 files changed, 89 insertions(+), 34 deletions(-) diff --git a/cmd/lambda/ec2ops_test.go b/cmd/lambda/ec2ops_test.go index 8d6f40c..c20da4b 100644 --- a/cmd/lambda/ec2ops_test.go +++ b/cmd/lambda/ec2ops_test.go @@ -20,9 +20,9 @@ func TestResolveAZUnit(t *testing.T) { return describeResponse(), nil } h := newTestHandler(mock) - az, vpc := h.resolveAZ(context.Background(), "i-gone") - if az != "" || vpc != "" { - t.Errorf("expected ('', ''), got (%q, %q)", az, vpc) + inst, az, vpc := h.resolveAZ(context.Background(), "i-gone") + if inst != nil || az != "" || vpc != "" { + t.Errorf("expected (nil, '', ''), got (%v, %q, %q)", inst, az, vpc) } }) @@ -33,9 +33,9 @@ func TestResolveAZUnit(t *testing.T) { return describeResponse(inst), nil } h := newTestHandler(mock) - az, vpc := h.resolveAZ(context.Background(), "i-other") - if az != "" || vpc != "" { - t.Errorf("expected ('', ''), got (%q, %q)", az, vpc) + gotInst, az, vpc := h.resolveAZ(context.Background(), "i-other") + if gotInst != nil || az != "" || vpc != "" { + t.Errorf("expected (nil, '', ''), got (%v, %q, %q)", gotInst, az, vpc) } }) @@ -47,9 +47,9 @@ func TestResolveAZUnit(t *testing.T) { return describeResponse(inst), nil } h := newTestHandler(mock) - az, vpc := h.resolveAZ(context.Background(), "i-ign") - if az != "" || vpc != "" { - t.Errorf("expected ('', ''), got (%q, %q)", az, vpc) + gotInst, az, vpc := h.resolveAZ(context.Background(), "i-ign") + if gotInst != nil || az != "" || vpc != "" { + t.Errorf("expected (nil, '', ''), got (%v, %q, %q)", gotInst, az, vpc) } }) @@ -61,9 +61,9 @@ func TestResolveAZUnit(t *testing.T) { return describeResponse(inst), nil } h := newTestHandler(mock) - az, vpc := h.resolveAZ(context.Background(), "i-nat") - if az != testAZ || vpc != testVPC { - t.Errorf("expected (%q, %q), got (%q, %q)", testAZ, testVPC, az, vpc) + gotInst, az, vpc := h.resolveAZ(context.Background(), "i-nat") + if gotInst == nil || az != testAZ || vpc != testVPC { + t.Errorf("expected (inst, %q, %q), got (%v, %q, %q)", testAZ, testVPC, gotInst, az, vpc) } }) @@ -75,9 +75,9 @@ func TestResolveAZUnit(t *testing.T) { return describeResponse(inst), nil } h := newTestHandler(mock) - az, vpc := h.resolveAZ(context.Background(), "i-work") - if az != testAZ || vpc != testVPC { - t.Errorf("expected (%q, %q), got (%q, %q)", testAZ, testVPC, az, vpc) + gotInst, az, vpc := h.resolveAZ(context.Background(), "i-work") + if gotInst == nil || az != testAZ || vpc != testVPC { + t.Errorf("expected (inst, %q, %q), got (%v, %q, %q)", testAZ, testVPC, gotInst, az, vpc) } }) } diff --git a/cmd/lambda/handler.go b/cmd/lambda/handler.go index d2b2e1f..87aa633 100644 --- a/cmd/lambda/handler.go +++ b/cmd/lambda/handler.go @@ -40,32 +40,33 @@ func (h *Handler) handle(ctx context.Context, event Event) error { log.Printf("instance=%s state=%s", event.InstanceID, event.State) - az, vpc := h.resolveAZ(ctx, event.InstanceID) + triggerInst, az, vpc := h.resolveAZ(ctx, event.InstanceID) if az == "" { // Instance gone from API or wrong VPC/ignored — sweep all AZs. h.sweepAllAZs(ctx) return nil } - h.reconcile(ctx, az, vpc, event) + h.reconcile(ctx, az, vpc, event, triggerInst) return nil } // resolveAZ looks up the trigger instance to determine which AZ to reconcile. -// Returns ("", "") if the instance is gone, wrong VPC, or has the ignore tag. -func (h *Handler) resolveAZ(ctx context.Context, instanceID string) (az, vpc string) { +// Returns the instance itself (for use in reconcile) plus its AZ and VPC. +// Returns (nil, "", "") if the instance is gone, wrong VPC, or has the ignore tag. +func (h *Handler) resolveAZ(ctx context.Context, instanceID string) (*Instance, string, string) { defer timed("resolve_az")() inst := h.getInstance(ctx, instanceID) if inst == nil { - return "", "" + return nil, "", "" } if inst.VpcID != h.TargetVPC { - return "", "" + return nil, "", "" } if hasTag(inst.Tags, h.IgnoreTagKey, h.IgnoreTagValue) { - return "", "" + return nil, "", "" } - return inst.AZ, inst.VpcID + return inst, inst.AZ, inst.VpcID } // sweepAllAZs reconciles every AZ that has a launch template configured. @@ -73,19 +74,40 @@ func (h *Handler) sweepAllAZs(ctx context.Context) { defer timed("sweep_all_azs")() azs := h.findConfiguredAZs(ctx) for _, az := range azs { - h.reconcile(ctx, az, h.TargetVPC, Event{}) + h.reconcile(ctx, az, h.TargetVPC, Event{}, nil) } } // reconcile observes the current state of workloads, NAT, and EIPs in an AZ, // then takes at most one mutating action to converge toward the desired state. -func (h *Handler) reconcile(ctx context.Context, az, vpc string, event Event) { +// triggerInst is the instance that triggered this reconcile (from resolveAZ). +func (h *Handler) reconcile(ctx context.Context, az, vpc string, event Event, triggerInst *Instance) { defer timed("reconcile")() workloads := h.findWorkloads(ctx, az, vpc) nats := h.findNATs(ctx, az, vpc) eips := h.findEIPs(ctx, az) + // --- Handle EC2 eventual consistency for NAT instances --- + // If the trigger instance is a NAT that findNATs() missed (because tags + // haven't propagated yet), add it to the list. This prevents the Lambda + // from trying to create a duplicate NAT when processing a newly-created + // NAT's pending/running event. + if triggerInst != nil && hasTag(triggerInst.Tags, h.NATTagKey, h.NATTagValue) { + found := false + for _, n := range nats { + if n.InstanceID == triggerInst.InstanceID { + found = true + break + } + } + if !found && (triggerInst.StateName == "pending" || triggerInst.StateName == "running" || + triggerInst.StateName == "stopping" || triggerInst.StateName == "stopped") { + log.Printf("Adding trigger NAT %s to nats list (eventual consistency)", triggerInst.InstanceID) + nats = append([]*Instance{triggerInst}, nats...) + } + } + needNAT := len(workloads) > 0 // --- Duplicate NAT cleanup (before anything else) --- @@ -120,16 +142,13 @@ func (h *Handler) reconcile(ctx context.Context, az, vpc string, event Event) { return } // nat is pending or running — good. - // If the NAT appears "pending" from filters but the EventBridge event - // says it's "running", re-query by instance ID for authoritative state. - // Filter-based DescribeInstances is subject to EC2 eventual consistency - // and may lag behind the actual state transition. + // If the NAT appears "pending" but the EventBridge event says "running", + // trust the event. EC2 API responses are eventually consistent and may + // lag behind the actual state transition. EventBridge events are + // authoritative for state changes. if nat.StateName == "pending" && event.InstanceID == nat.InstanceID && event.State == "running" { - log.Printf("NAT %s shows pending in filters but event says running, re-querying", nat.InstanceID) - fresh := h.getInstance(ctx, nat.InstanceID) - if fresh != nil { - nat = fresh - } + log.Printf("NAT %s shows pending but event says running, trusting event", nat.InstanceID) + nat.StateName = "running" } } else { if nat != nil && (nat.StateName == "running" || nat.StateName == "pending") { diff --git a/cmd/lambda/handler_test.go b/cmd/lambda/handler_test.go index 0d659eb..6d862e6 100644 --- a/cmd/lambda/handler_test.go +++ b/cmd/lambda/handler_test.go @@ -649,6 +649,42 @@ func TestReconcileNATEvent(t *testing.T) { } }) + t.Run("NAT pending event not found by filter uses triggerInst", func(t *testing.T) { + // Simulates EC2 eventual consistency: NAT was just created, its pending + // event fires, but findNATs() doesn't see it yet because tags haven't + // propagated. The reconciler should use the trigger instance directly + // to avoid trying to create a duplicate NAT. + mock := &mockEC2{} + workInst := makeTestInstance("i-work1", "running", testVPC, testAZ, workTags, nil) + eni := makeENI("eni-pub1", 0, "10.0.1.10", nil) + natInst := makeTestInstance("i-nat1", "pending", testVPC, testAZ, natTags, []ec2types.InstanceNetworkInterface{eni}) + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + if len(params.InstanceIds) > 0 { + // By-ID query finds the NAT + return describeResponse(natInst), nil + } + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + // Filter query doesn't see it yet (tags not propagated) + return describeResponse(), nil + } + } + return describeResponse(workInst), nil + } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{}, nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-nat1", State: "pending"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + // Should NOT try to create a new NAT (would fail with ENI-in-use) + if mock.callCount("RunInstances") != 0 { + t.Error("should not call RunInstances when trigger NAT exists but filter doesn't see it") + } + }) + t.Run("NAT terminated event with workloads creates new", func(t *testing.T) { mock := &mockEC2{} workInst := makeTestInstance("i-work1", "running", testVPC, testAZ, workTags, nil) From ef4a5dac7451a8c031e1cafbc4fbba97269dca3e Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Thu, 26 Feb 2026 12:34:32 +1000 Subject: [PATCH 26/30] docs: add missing time_sleep.eventbridge_propagation to terraform-docs Co-Authored-By: Claude Opus 4.5 --- README.md | 1 + docs/REFERENCE.md | 1 + 2 files changed, 2 insertions(+) diff --git a/README.md b/README.md index 5734ab9..cf65825 100644 --- a/README.md +++ b/README.md @@ -145,6 +145,7 @@ No modules. | [aws_security_group.nat_security_group](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/security_group) | resource | | [null_resource.build_lambda](https://registry.terraform.io/providers/hashicorp/null/latest/docs/resources/resource) | resource | | [null_resource.download_lambda](https://registry.terraform.io/providers/hashicorp/null/latest/docs/resources/resource) | resource | +| [time_sleep.eventbridge_propagation](https://registry.terraform.io/providers/hashicorp/time/latest/docs/resources/sleep) | resource | | [time_sleep.lambda_ready](https://registry.terraform.io/providers/hashicorp/time/latest/docs/resources/sleep) | resource | ## Inputs diff --git a/docs/REFERENCE.md b/docs/REFERENCE.md index 4683876..914612b 100644 --- a/docs/REFERENCE.md +++ b/docs/REFERENCE.md @@ -42,6 +42,7 @@ No modules. | [aws_security_group.nat_security_group](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/security_group) | resource | | [null_resource.build_lambda](https://registry.terraform.io/providers/hashicorp/null/latest/docs/resources/resource) | resource | | [null_resource.download_lambda](https://registry.terraform.io/providers/hashicorp/null/latest/docs/resources/resource) | resource | +| [time_sleep.eventbridge_propagation](https://registry.terraform.io/providers/hashicorp/time/latest/docs/resources/sleep) | resource | | [time_sleep.lambda_ready](https://registry.terraform.io/providers/hashicorp/time/latest/docs/resources/sleep) | resource | ## Inputs From 5c85189e9cee84b28b0e04a61960488741bf76a7 Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Thu, 26 Feb 2026 13:14:55 +1000 Subject: [PATCH 27/30] fix: trust event state for stopped NAT (EC2 eventual consistency) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When EventBridge fires a "stopped" event for a NAT instance, the EC2 API may still return "stopping" due to eventual consistency. Previously the handler would wait for another event that never comes, leaving the EIP attached indefinitely. Now we trust the event state (like we already do for pending→running), allowing EIP release to proceed immediately when the NAT is truly stopped. Co-Authored-By: Claude Opus 4.5 --- cmd/lambda/handler.go | 12 +++++++--- cmd/lambda/handler_test.go | 45 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 54 insertions(+), 3 deletions(-) diff --git a/cmd/lambda/handler.go b/cmd/lambda/handler.go index 87aa633..bdc70b7 100644 --- a/cmd/lambda/handler.go +++ b/cmd/lambda/handler.go @@ -157,9 +157,15 @@ func (h *Handler) reconcile(ctx context.Context, az, vpc string, event Event, tr return } if nat != nil && nat.StateName == "stopping" { - log.Printf("Reconcile %s: waiting (workloads=0, nat=stopping, eips=%d)", - az, len(eips)) - return + // Trust event state - EC2 API may lag behind the actual transition + if event.InstanceID == nat.InstanceID && event.State == "stopped" { + log.Printf("NAT %s shows stopping but event says stopped, trusting event", nat.InstanceID) + nat.StateName = "stopped" + } else { + log.Printf("Reconcile %s: waiting (workloads=0, nat=stopping, eips=%d)", + az, len(eips)) + return + } } // nat is stopped/nil — good } diff --git a/cmd/lambda/handler_test.go b/cmd/lambda/handler_test.go index 6d862e6..e86e613 100644 --- a/cmd/lambda/handler_test.go +++ b/cmd/lambda/handler_test.go @@ -685,6 +685,51 @@ func TestReconcileNATEvent(t *testing.T) { } }) + t.Run("NAT stopped event with stale stopping filter releases EIP", func(t *testing.T) { + // Simulates EC2 eventual consistency: EventBridge says "stopped" but + // filter-based DescribeInstances still returns "stopping". The reconciler + // should trust the event state and release the EIP. + mock := &mockEC2{} + eni := makeENI("eni-pub1", 0, "10.0.1.10", &ec2types.InstanceNetworkInterfaceAssociation{PublicIp: aws.String("1.2.3.4")}) + natStopping := makeTestInstance("i-nat1", "stopping", testVPC, testAZ, natTags, []ec2types.InstanceNetworkInterface{eni}) + eip := ec2types.Address{ + AllocationId: aws.String("eipalloc-1"), + PublicIp: aws.String("1.2.3.4"), + Tags: []ec2types.Tag{{Key: aws.String("nat-zero:managed"), Value: aws.String("true")}}, + } + mock.DescribeInstancesFn = func(ctx context.Context, params *ec2.DescribeInstancesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) { + if len(params.InstanceIds) > 0 { + // By-ID queries still show stopping (API lag) + return describeResponse(natStopping), nil + } + for _, f := range params.Filters { + if aws.ToString(f.Name) == "tag:nat-zero:managed" { + // Filter query lags — still shows stopping + return describeResponse(natStopping), nil + } + } + // No workloads + return describeResponse(), nil + } + mock.DescribeAddressesFn = func(ctx context.Context, params *ec2.DescribeAddressesInput, optFns ...func(*ec2.Options)) (*ec2.DescribeAddressesOutput, error) { + return &ec2.DescribeAddressesOutput{Addresses: []ec2types.Address{eip}}, nil + } + mock.DisassociateAddressFn = func(ctx context.Context, params *ec2.DisassociateAddressInput, optFns ...func(*ec2.Options)) (*ec2.DisassociateAddressOutput, error) { + return &ec2.DisassociateAddressOutput{}, nil + } + mock.ReleaseAddressFn = func(ctx context.Context, params *ec2.ReleaseAddressInput, optFns ...func(*ec2.Options)) (*ec2.ReleaseAddressOutput, error) { + return &ec2.ReleaseAddressOutput{}, nil + } + h := newTestHandler(mock) + err := h.HandleRequest(context.Background(), Event{InstanceID: "i-nat1", State: "stopped"}) + if err != nil { + t.Fatalf("unexpected error: %v", err) + } + if mock.callCount("ReleaseAddress") != 1 { + t.Errorf("expected ReleaseAddress=1, got %d (stale stopping should be corrected by trusting event)", mock.callCount("ReleaseAddress")) + } + }) + t.Run("NAT terminated event with workloads creates new", func(t *testing.T) { mock := &mockEC2{} workInst := makeTestInstance("i-work1", "running", testVPC, testAZ, workTags, nil) From a559715546a7343970a667f26c0db0cc891d859a Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Thu, 26 Feb 2026 14:04:40 +1000 Subject: [PATCH 28/30] fix: trust event state for trigger instance, increase EventBridge delay - Always override trigger instance state with event state (EC2 API eventual consistency). Simplified from two ad-hoc checks to one centralized override. - Increase EventBridge propagation delay from 15s to 30s to reduce dropped events during initial deployment. Co-Authored-By: Claude Opus 4.5 --- cmd/lambda/handler.go | 30 ++++++++++++------------------ eventbridge.tf | 2 +- 2 files changed, 13 insertions(+), 19 deletions(-) diff --git a/cmd/lambda/handler.go b/cmd/lambda/handler.go index bdc70b7..6866e1f 100644 --- a/cmd/lambda/handler.go +++ b/cmd/lambda/handler.go @@ -41,6 +41,10 @@ func (h *Handler) handle(ctx context.Context, event Event) error { log.Printf("instance=%s state=%s", event.InstanceID, event.State) triggerInst, az, vpc := h.resolveAZ(ctx, event.InstanceID) + // Trust event state over EC2 API (eventual consistency) + if triggerInst != nil { + triggerInst.StateName = event.State + } if az == "" { // Instance gone from API or wrong VPC/ignored — sweep all AZs. h.sweepAllAZs(ctx) @@ -118,6 +122,10 @@ func (h *Handler) reconcile(ctx context.Context, az, vpc string, event Event, tr var nat *Instance if len(nats) > 0 { nat = nats[0] + // Trust event state over EC2 API for trigger instance + if nat.InstanceID == event.InstanceID { + nat.StateName = event.State + } } // --- NAT convergence (one action per invocation) --- @@ -141,15 +149,7 @@ func (h *Handler) reconcile(ctx context.Context, az, vpc string, event Event, tr log.Printf("NAT %s is stopping, waiting for next event", nat.InstanceID) return } - // nat is pending or running — good. - // If the NAT appears "pending" but the EventBridge event says "running", - // trust the event. EC2 API responses are eventually consistent and may - // lag behind the actual state transition. EventBridge events are - // authoritative for state changes. - if nat.StateName == "pending" && event.InstanceID == nat.InstanceID && event.State == "running" { - log.Printf("NAT %s shows pending but event says running, trusting event", nat.InstanceID) - nat.StateName = "running" - } + // nat is pending or running — good } else { if nat != nil && (nat.StateName == "running" || nat.StateName == "pending") { log.Printf("No workloads in %s, stopping NAT %s", az, nat.InstanceID) @@ -157,15 +157,9 @@ func (h *Handler) reconcile(ctx context.Context, az, vpc string, event Event, tr return } if nat != nil && nat.StateName == "stopping" { - // Trust event state - EC2 API may lag behind the actual transition - if event.InstanceID == nat.InstanceID && event.State == "stopped" { - log.Printf("NAT %s shows stopping but event says stopped, trusting event", nat.InstanceID) - nat.StateName = "stopped" - } else { - log.Printf("Reconcile %s: waiting (workloads=0, nat=stopping, eips=%d)", - az, len(eips)) - return - } + log.Printf("Reconcile %s: waiting (workloads=0, nat=stopping, eips=%d)", + az, len(eips)) + return } // nat is stopped/nil — good } diff --git a/eventbridge.tf b/eventbridge.tf index dd68318..cff330f 100644 --- a/eventbridge.tf +++ b/eventbridge.tf @@ -47,5 +47,5 @@ resource "time_sleep" "eventbridge_propagation" { aws_cloudwatch_event_target.state_change_lambda_target, aws_lambda_permission.allow_ec2_state_change_eventbridge, ] - create_duration = "15s" + create_duration = "30s" } From 83ed7d76329a1480efab37d4059b1ccc6aefbb89 Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Thu, 26 Feb 2026 14:43:07 +1000 Subject: [PATCH 29/30] fix: increase EventBridge propagation delay to 60s Co-Authored-By: Claude Opus 4.5 --- eventbridge.tf | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/eventbridge.tf b/eventbridge.tf index cff330f..adb7d03 100644 --- a/eventbridge.tf +++ b/eventbridge.tf @@ -47,5 +47,5 @@ resource "time_sleep" "eventbridge_propagation" { aws_cloudwatch_event_target.state_change_lambda_target, aws_lambda_permission.allow_ec2_state_change_eventbridge, ] - create_duration = "30s" + create_duration = "60s" } From 7635d20215cd4a1eb827934d66d5cd579779d07e Mon Sep 17 00:00:00 2001 From: Leonard O'Sullivan Date: Thu, 26 Feb 2026 15:57:07 +1000 Subject: [PATCH 30/30] docs: add pattern origins, reliability notes, and fix config versioning MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Add reconciliation pattern history (control theory → CFEngine → Borg → K8s) - Document EC2 API eventual consistency handling - Document EventBridge propagation delay (60s wait) - Document NAT Force stop behavior and Lambda timeout - Clarify config versioning two-event replacement process - Add IDE directories to .gitignore Co-Authored-By: Claude Opus 4.5 --- .gitignore | 5 ++++ docs/ARCHITECTURE.md | 55 +++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 59 insertions(+), 1 deletion(-) diff --git a/.gitignore b/.gitignore index cc9f64b..9476cc0 100644 --- a/.gitignore +++ b/.gitignore @@ -20,5 +20,10 @@ vendor/ # OS .DS_Store +# IDE +.idea/ +.vscode/ +*.swp + # AI .claude/ diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index 1c87e31..85a3da9 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -4,6 +4,19 @@ nat-zero uses a **reconciliation pattern** to manage NAT instance lifecycles. A single Lambda function (concurrency=1) observes the current state of an AZ and takes one action to converge toward desired state, then returns. The next event picks up where this one left off. +### Pattern Origins + +The reconciliation loop pattern has deep roots: + +- **Control theory (1788+)**: Feedback loops comparing actual state to desired state, taking corrective action +- **CFEngine (1993)**: Mark Burgess introduced "convergence" to configuration management +- **Google Borg/Omega (2005+)**: Internal cluster managers used reconciliation controllers +- **Kubernetes (2014+)**: Popularized the pattern as "level-triggered" vs "edge-triggered" logic + +The key insight: **state is more useful than events**. Rather than tracking event sequences, we observe current state and compute the delta. This makes the system robust to missed events, crashes, and restarts. + +See: [Borg, Omega, and Kubernetes (ACM Queue)](https://queue.acm.org/detail.cfm?id=2898444), [Tim Hockin - Edge vs Level Triggered Logic](https://speakerdeck.com/thockin/edge-vs-level-triggered-logic) + ``` EventBridge (EC2 state changes) │ @@ -143,4 +156,44 @@ Each NAT instance uses two ENIs to separate public and private traffic: ## Config Versioning -The Lambda tags each NAT instance with a `ConfigVersion` hash derived from AMI, instance type, market type, and volume size. When a workload event arrives and the existing NAT has an outdated hash, the reconciler terminates it. The next event creates a replacement with the current config. +The Lambda tags each NAT instance with a `ConfigVersion` hash derived from AMI, instance type, market type, and volume size. + +When the reconciler detects an outdated NAT, replacement takes two events (following the "one action per invocation" pattern): + +1. **Event 1**: Outdated config detected → terminate NAT → return +2. **Event 2**: NAT is now `shutting-down`/`terminated` → create new NAT with current config + +This avoids racing with ENI detachment and keeps error handling simple. + +## Reliability + +### EC2 API Eventual Consistency + +The EC2 API is eventually consistent. When EventBridge fires a state change event (e.g., `running`), the EC2 DescribeInstances API may still return the previous state (e.g., `pending`) for several seconds. + +nat-zero handles this by **trusting the event state** for the trigger instance: + +```go +// Trust event state over EC2 API (eventual consistency) +if triggerInst != nil { + triggerInst.StateName = event.State +} +``` + +This also applies to NAT instances that may not appear in filter-based queries immediately after creation (tag propagation delay). The reconciler adds the trigger instance to the NAT list if it's missing. + +### EventBridge Propagation Delay + +After Terraform creates the EventBridge rule and target, there's a propagation delay before events are reliably delivered. Events fired during this window may be silently dropped. + +nat-zero includes a 60-second `time_sleep` resource after target creation to mitigate this. Workloads launched immediately after `terraform apply` may still miss their initial events, but subsequent events will trigger reconciliation. + +### NAT Stop Behavior + +NAT instances are stopped with `Force=true` because they're stateless packet forwarders. There's no graceful shutdown needed — the routing table instantly fails over when the ENI becomes unreachable, and workloads retry their connections. + +### Lambda Timeout + +The Lambda has a 90-second timeout. Typical invocations complete in 400-600ms. The extended timeout accommodates: +- Cleanup operations during `terraform destroy` (terminate NATs, wait for ENI detachment, release EIPs) +- Slow EC2 API responses under load