Skip to content

vegcom/charon

Repository files navigation

Charon

Self-healing Kubernetes infrastructure that actually stays up β€” automated lifecycle management, distributed tracing, and VPN-only security that just works

Production-grade platform featuring intelligent lifecycle automation that self-heals DNS and VPN nodes before they break, comprehensive observability with distributed tracing (Prometheus, Grafana, Loki, Thanos, Tempo), and category-based namespace isolation. Deploy AI workloads, run production services, sleep through the night. All infrastructure hermetically sealed behind Headscale VPN mesh, deployed with a single terraform apply.

Key Features:

  • βœ… Category-based namespace isolation (core, gitops, inference, infra, monitoring)
  • βœ… Headscale VPN mesh networking (100.64.0.0/10)
  • βœ… FreeIPA identity management (LDAP/Kerberos) for all services
  • βœ… Self-healing DNS with automatic cleanup and init container updates
  • βœ… Automated TLS certificates (Let's Encrypt via cert-manager)
  • βœ… Comprehensive observability (Prometheus, Grafana, Loki, Thanos, Tempo)
  • βœ… Distributed tracing with OpenTelemetry integration (Open-WebUI traces to Tempo)
  • βœ… Fixed and correlated dashboards (Kubernetes, Headscale, Open-WebUI metrics)
  • βœ… Failure-resilient deployment (no circular dependencies)
  • βœ… Services: Headscale, FreeIPA, Grafana, Prometheus, Tempo, Loki, Thanos, Open-WebUI, Ollama, Redmine, GitLab, ArgoCD, NetBox

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ VPN Client  │────▢│  Tailscale   │────▢│    Kubernetes Cluster       β”‚
β”‚ 100.64.0.0  β”‚     β”‚  VPN Mesh    β”‚     β”‚  (Category Namespaces)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                      β”‚
                            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                            β”‚                         β”‚                         β”‚
                      β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
                      β”‚    core    β”‚          β”‚   gitops    β”‚         β”‚  inference     β”‚
                      β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€          β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€         β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
                      β”‚ Headscale  β”‚          β”‚  Redmine    β”‚         β”‚  Open-WebUI    β”‚
                      β”‚  FreeIPA   β”‚          β”‚  (GitLab)   β”‚         β”‚    Ollama      β”‚
                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                      β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                      β”‚monitoring  β”‚          β”‚    infra     β”‚
                      β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€          β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
                      β”‚ Prometheus β”‚          β”‚  (NetBox)    β”‚
                      β”‚  Grafana   β”‚          β”‚  (Vault)     β”‚
                      β”‚   Loki     β”‚          β”‚              β”‚
                      β”‚  Thanos    β”‚          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

All services: nginx-tls (443) ──▢ Application (8080) + Tailscale (VPN)

All services use the 3-Container StatefulSet Pattern (nginx-tls + application + Tailscale sidecar).

Full Architecture Documentation β†’

Quick Start

Prerequisites: Kubernetes cluster, Terraform, kubectl, Cloudflare account with API token

Complete Prerequisites Guide β†’

Deploy in 5 Minutes

# Clone and configure
git clone https://github.com/vegcom/charon.git
cd charon

# Set up credentials
cat > .env << EOF
CLOUDFLARE_API_TOKEN="your-cloudflare-api-token"
REDMINE_DB_HOST="postgres.example.com"
REDMINE_DB_PORT="5432"
REDMINE_DB_USER="redmine_user"
REDMINE_DB_NAME="redmine_production"
REDMINE_DB_PASSWORD="secure-password-here"
EOF

# Configure Terraform
cd terraform
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your settings

# Deploy everything
source ../.env
export TF_VAR_cloudflare_api_token="$CLOUDFLARE_API_TOKEN"
terraform init
terraform apply

What deploys automatically:

  • cert-manager + Let's Encrypt
  • Headscale VPN with user/key management
  • FreeIPA identity management
  • All services with TLS certificates and LDAP auth
  • DNS records (self-healing with init container updates)
  • Prometheus, Grafana, Loki, Thanos, Tempo observability stack
  • Distributed tracing with OpenTelemetry integration
  • VPN mesh network
  • Custom Docker images built in-cluster (NetBox plugins, lifecycle automation)

Detailed Quick Start Guide β†’

Connect to VPN

# Generate pre-auth key
kubectl exec -n core headscale-0 -- headscale preauthkeys create \
  --user default --reusable --expiration 90d

# Connect your device
tailscale up --login-server https://vpn.example.com --authkey <key>

# Access services (VPN required)
open https://grafana.example.com  # Monitoring dashboards
open https://ipa.example.com      # Identity management
open https://redmine.example.com  # Project management
open https://ai.example.com       # AI chat interface

VPN Enrollment Guide β†’

Key Features

Category-Based Namespaces (NEW)

Services organized by function with strict RBAC boundaries:

  • core - Infrastructure everyone depends on (Headscale VPN, FreeIPA identity)
  • gitops - Development tools (Redmine, GitLab)
  • inference - AI/LLM workloads (Open-WebUI, Ollama)
  • infra - Operations tooling (NetBox, Vault)
  • monitoring - Observability stack (Prometheus, Grafana, Loki, Thanos, AlertManager)

Benefits:

  • Clear security boundaries
  • Independent resource quotas
  • Simplified RBAC management
  • Logical service grouping

FreeIPA Identity Management

Centralized LDAP/Kerberos authentication for all services:

  • Single sign-on across all applications
  • User and group management
  • LDAPS (port 636) for secure authentication
  • Automated LDAP configuration via scripts

LDAP Integration Guide β†’

Comprehensive Observability

Full monitoring, logging, distributed tracing, and long-term storage:

  • Prometheus - Current metrics collection (35+ targets across 10 jobs)
  • Grafana - Dashboards and visualization with Tempo correlations
  • Loki - Log aggregation (short-term, emptyDir)
  • Thanos - Long-term metrics storage (2x 50Gi retain PVCs)
  • Tempo - Distributed tracing with OpenTelemetry integration
  • Promtail - Log collection from all pods

Monitoring Guide β†’

Self-Healing DNS

Automated DNS management with multiple update strategies:

  • Fallback IPs (node IPs) prevent deployment failures
  • Init container updates on pod startup
  • Async updates when pods connect to VPN
  • Automatic cleanup of stale records
  • Per-service RBAC with strict permissions

DNS Management Guide β†’

Resilient Deployment

Architecture designed to handle failures gracefully:

  • DNS creates with fallback IPs first
  • Services deploy independently
  • System works even when pods can't start
  • No circular dependency failures
  • Self-healing when pods recover

Dependency Patterns β†’

3-Container Pattern

Standardized StatefulSet architecture for all services:

  1. nginx-tls - HTTPS termination and reverse proxy
  2. Application - Main service (localhost only)
  3. Tailscale - VPN sidecar

Benefits: Security isolation, stability, reliability, consistency

StatefulSet Pattern Details β†’

Documentation

πŸ“– Getting Started

πŸ—οΈ Architecture

πŸ› οΈ Services

πŸ“š Guides

βš™οΈ Operations

πŸ’» Development

πŸ“‹ Reference

Complete Documentation Index β†’

Management

# View VPN devices
kubectl exec -n core headscale-0 -- headscale nodes list

# Backup Redmine database
REDMINE_DB_HOST=... REDMINE_DB_PORT=... python3 scripts/redmine/backup_restore_db.py backup

# Restore database
REDMINE_DB_HOST=... python3 scripts/redmine/backup_restore_db.py restore --file backup.sql

# Configure LDAP for Redmine
python3 scripts/redmine/configure_ldap.py \
  --ldap-host freeipa.core.svc.cluster.local \
  --ldap-port 636 \
  --bind-dn "uid=admin,cn=users,cn=accounts,dc=example,dc=org" \
  --bind-password "password" \
  --base-dn "cn=users,cn=accounts,dc=example,dc=org"

Operations Guide β†’

Troubleshooting

Pods not starting? Check storage class and PVCs

kubectl get pvc -A
kubectl describe pod <pod-name> -n <namespace>

DNS not resolving? Verify Cloudflare credentials

dig vpn.example.com

Certificates failing? Check cert-manager and Cloudflare token permissions

kubectl get certificate -A
kubectl logs -n cert-manager -l app=cert-manager

VPN issues? Verify Headscale is running and external ingress accessible

kubectl logs -n core headscale-0
curl -I https://vpn.example.com/health

LDAP auth not working? Check FreeIPA connectivity and credentials

kubectl exec -n core freeipa-0 -- ldapsearch -x -b "cn=users,cn=accounts,dc=example,dc=org"

Full Troubleshooting Guide β†’

Security

  • Category-based namespace isolation with strict RBAC
  • FreeIPA LDAP/Kerberos for all service authentication
  • API tokens managed via Terraform (stored in .env file, never committed)
  • Automated TLS certificates (Let's Encrypt)
  • VPN-only service access (100.64.0.0/10 range)
  • Per-service RBAC with minimal permissions
  • Volume encryption at rest
  • Cross-namespace RBAC documented and scoped

Security Architecture β†’

Contributing

Contributions welcome! See Contributing Guide for:

  • Development environment setup
  • Code standards (Terraform, Python)
  • Testing procedures
  • Pull request process
  • Branch naming conventions (feat/, fix/, docs/, etc.)

Key rule: All infrastructure changes via Terraform following Dependency Patterns

CONTRIBUTING.md | Code Standards

Tear Down

cd terraform
terraform destroy  # Removes everything except external databases

Warning: Deletes all services and data! External databases (like Redmine's Akamai PostgreSQL) are not affected.

Author

Maintained by @vegcom

License

MIT License


Quick Links: Documentation | Quick Start | Troubleshooting | Contributing

About

πŸŽ‡ Self healing infra in a box

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published