Complete guide to Lynkr's architecture, request flow, and core capabilities.
┌─────────────────┐
│ Claude Code CLI │ or Cursor IDE
└────────┬────────┘
│ Anthropic/OpenAI Format
↓
┌─────────────────┐
│ Lynkr Proxy │
│ Port: 8081 │
│ │
│ • Format Conv. │
│ • Token Optim. │
│ • Provider Route│
│ • Tool Calling │
│ • Caching │
└────────┬────────┘
│
├──→ Databricks (Claude 4.5)
├──→ AWS Bedrock (100+ models)
├──→ OpenRouter (100+ models)
├──→ Moonshot AI (Kimi K2)
├──→ Ollama (local, free)
├──→ llama.cpp (local, free)
├──→ Azure OpenAI (GPT-4o, o1)
├──→ OpenAI (GPT-4o, o3)
└──→ Azure Anthropic (Claude)
Entry Points:
/v1/messages- Anthropic format (Claude Code CLI)/v1/chat/completions- OpenAI format (Cursor IDE)
Middleware Stack:
- Load shedding (reject if overloaded)
- Request logging (with correlation ID)
- Validation (schema check)
- Metrics collection
- Route to orchestrator
4-Tier Intelligent Routing:
Lynkr uses a multi-phase complexity analysis to route each request to the optimal model tier:
| Tier | Score | Routes To |
|---|---|---|
| SIMPLE (0-25) | Greetings, simple Q&A | Cheap/local models (Ollama, llama.cpp) |
| MEDIUM (26-50) | Code reading, simple edits | Mid-range models (GPT-4o, Claude Sonnet) |
| COMPLEX (51-75) | Multi-file changes, debugging | Capable models (o1-mini, Claude Sonnet) |
| REASONING (76-100) | Security audits, architecture | Best models (o1, Claude Opus) |
Includes agentic workflow detection, 15-dimension weighted scoring, and cost optimization. See Routing & Model Tiering for full details.
Automatic Fallback:
- If primary provider fails → Use FALLBACK_PROVIDER
- Transparent to client
- No request failures due to provider issues
Anthropic → Provider:
{
model: "claude-3-5-sonnet",
messages: [...],
tools: [...]
}
↓
Provider-specific format
(Databricks, Bedrock, OpenRouter, etc.)Provider → Anthropic:
Provider response
↓
{
id: "msg_...",
type: "message",
role: "assistant",
content: [{type: "text", text: "..."}],
usage: {input_tokens: 123, output_tokens: 456}
}6 Phases Applied:
- Smart tool selection
- Prompt caching
- Memory deduplication
- Tool response truncation
- Dynamic system prompts
- Conversation compression
Result: 60-80% token reduction
Server Mode (default):
- Tools execute on Lynkr server
- Access server filesystem
- Server-side command execution
Client Mode (passthrough):
- Tools execute on CLI side
- Access client filesystem
- Client-side command execution
Token-by-Token Streaming:
// SSE format
event: message
data: {"type":"content_block_delta","delta":{"type":"text_delta","text":"Hello"}}
event: message
data: {"type":"content_block_delta","delta":{"type":"text_delta","text":" world"}}
event: done
data: {}Benefits:
- Real-time user feedback
- Lower perceived latency
- Better UX for long responses
router.js - Main routes
/v1/messages- Anthropic format/v1/chat/completions- OpenAI format/v1/models- List models/v1/embeddings- Generate embeddings/health/*- Health checks/metrics- Prometheus metrics
Middleware:
load-shedding.js- Overload protectionrequest-logging.js- Structured loggingmetrics.js- Metrics collectionvalidation.js- Input validationerror-handling.js- Error formatting
databricks.js - Main invocation function
invokeModel()- Route to providerinvokeDatabricks()- Databricks APIinvokeAzureAnthropic()- Azure AnthropicinvokeOpenRouter()- OpenRouterinvokeOllama()- Ollama localinvokeLlamaCpp()- llama.cppinvokeBedrock()- AWS BedrockinvokeMoonshot()- Moonshot AI (Kimi)invokeZai()- Z.AI (Zhipu AI)
Format converters:
openrouter-utils.js- OpenAI format conversionbedrock-utils.js- Bedrock format conversion
Reliability:
circuit-breaker.js- Circuit breaker patternretry.js- Exponential backoff with jitter
Agent Loop:
- Receive request
- Inject memories
- Call provider
- Execute tools (if requested)
- Return to provider
- Repeat until done (max 8 steps)
- Extract memories
- Return final response
Features:
- Tool execution modes (server/client)
- Policy enforcement
- Memory injection/extraction
- Token optimization
Standard Tools:
workspace.js- Read, Write, Edit filesgit.js- Git operationsbash.js- Shell command executiontest.js- Test harnesstask.js- Task trackingmemory.js- Memory management
MCP Tools:
- Dynamic tool registration
- JSON-RPC 2.0 communication
- Sandbox isolation (optional)
Prompt Cache:
- LRU cache with TTL
- SHA-256 keying
- Hit rate tracking
Memory Cache:
- In-memory storage
- TTL-based eviction
- Automatic cleanup
SQLite Databases:
memories.db- Long-term memoriessessions.db- Conversation historyworkspace-index.db- Workspace metadata
Operations:
- Memory CRUD
- Session tracking
- FTS5 search
Metrics:
- Request rate, latency, errors
- Token usage, cache hits
- Circuit breaker state
- System resources
Logging:
- Structured JSON logs (pino)
- Request ID correlation
- Error tracking
- Performance profiling
Environment Variables:
- Provider configuration
- Feature flags
- Policy settings
- Performance tuning
Validation:
- Required field checks
- Type validation
- Value constraints
- Provider-specific validation
12+ Providers:
- Cloud: Databricks, Bedrock, OpenRouter, Azure, OpenAI, Moonshot AI, Z.AI, Vertex AI
- Local: Ollama, llama.cpp, LM Studio
Hybrid Routing:
- 4-tier intelligent routing with complexity scoring
- Automatic provider selection and transparent failover
- Agentic workflow detection with tier upgrades
- Cost optimization with multi-source pricing
60-80% Cost Reduction:
- 6 optimization phases
- $77k-$115k annual savings
- Automatic optimization
Titans-Inspired:
- Surprise-based storage
- Semantic search (FTS5)
- Multi-signal retrieval
- Automatic extraction
14 Features:
- Circuit breakers
- Load shedding
- Graceful shutdown
- Prometheus metrics
- Health checks
- Error resilience
Model Context Protocol:
- Automatic discovery
- JSON-RPC 2.0 client
- Dynamic tool registration
- Sandbox isolation
Works With:
- Claude Code CLI (native)
- Cursor IDE (OpenAI format)
- Continue.dev (OpenAI format)
- Any OpenAI-compatible client
Request Throughput:
- 140,000 requests/second capacity
- ~7μs overhead per request
- Minimal performance impact
Latency:
- Local providers: 100-500ms
- Cloud providers: 500ms-2s
- Caching: <1ms (cache hits)
Memory Usage:
- Base: ~100MB
- Per connection: ~1MB
- Caching: ~50MB
Token Optimization:
- Average reduction: 60-80%
- Cache hit rate: 70-90%
- Dedup effectiveness: 85%
# Run multiple instances
PM2_INSTANCES=4 pm2 start lynkr
# Behind load balancer (nginx, HAProxy)
# Shared database for memories# Increase cache size
PROMPT_CACHE_MAX_ENTRIES=256
# Increase connection pool
# (provider-specific)# Enable WAL mode (better concurrency)
# Automatic vacuum
# Index optimization- Routing & Model Tiering - Intelligent routing and scoring algorithm
- Memory System - Long-term memory details
- Token Optimization - Cost reduction strategies
- Production Guide - Deploy to production
- Tools Guide - Tool execution modes
- GitHub Discussions - Ask questions
- GitHub Issues - Report issues