Cut Your LLM API Bill by 75% — Automatically

Self-hosted AI router that classifies every request and sends it to the cheapest model that can handle it. Zero dependencies. Built-in PII scrubbing. Your API keys stay on your machine.

Quick Start · How It Works · Features · Configuration · Documentation

The Problem

You're paying $15/M tokens for Claude Opus on "What's the weather?" You're sending API keys in plaintext to third-party proxies. Your app crashes when one provider has an outage.

The Solution

TierFlow sits between your app and your LLM providers. It classifies every request, routes it to the cheapest model that can handle it, scrubs PII before forwarding, and automatically fails over when providers go down.

Your App  -->  TierFlow  -->  Classifier  -->  Best Model for the Job
                  |
                  ├── "Hi there"        --> Ollama llama3.2     (free, local)
                  ├── "Write a parser"  --> Qwen3 Coder         (free tier)
                  ├── "Prove P=NP"      --> Claude Opus          (when it matters)
                  └── "Summarize CSV"   --> Gemini Flash Lite    ($0.01/M)

Result: 99% cost reduction on 20 real API calls ($0.003 instead of $0.27). Same quality. Your keys never leave your infrastructure.

Quick Start

npx tierflow --init     # generate config template
npx tierflow            # start on localhost:18800

Then point any OpenAI-compatible client at it:

curl http://localhost:18800/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"auto","messages":[{"role":"user","content":"Hello!"}]}'

That's it. TierFlow exposes a standard /v1/chat/completions endpoint. Any app that works with OpenAI works with TierFlow.

Other Install Options

Clone & Build

git clone https://github.com/frdaniel76/tierflow.git
cd tierflow
npm install && npm run build
npm start

Docker Compose

docker compose up -d      # starts router + ML classifier

See docs/docker.md for details.

How It Works

Two-Layer Classification

Layer 1: ML Classifier (primary) — Sentence embeddings (all-MiniLM-L6-v2) + KNN classify queries into 8 categories in ~40ms with 96%+ accuracy.

Layer 2: Rule-Based Scorer (fallback) — 14-dimension weighted keyword analysis in <1ms when the ML service is unavailable.

8 Routing Categories

Category	Best For	Example Models
`simple_chat`	Greetings, yes/no, definitions	Gemini Flash Lite, Ollama
`general`	Moderate questions, summaries	GPT-4o, DeepSeek V3
`coding`	Code generation, debugging	Qwen3 Coder, Codestral
`reasoning`	Proofs, logic, step-by-step	Claude Opus, o1
`creative`	Stories, poetry, brainstorming	GPT-4o, Claude Sonnet
`data`	CSV analysis, data extraction	Gemini Flash, GPT-4o-mini
`agentic`	Tool use, multi-step tasks	Claude Sonnet, GPT-4o
`transcription`	Audio/voice routing	Gemini Flash Lite

Every category is fully configurable — primary model, fallback chain, and timeout.

Features

Smart Routing

ML-powered 8-category classification — sentence-transformer embeddings + KNN (~40ms)
14-dimension keyword fallback — zero-dependency rule engine (<1ms)
Agentic detection — auto-routes tool-calling requests to capable models
Mode overrides — /simple, /max, [code], deep mode: prefixes to force routing
Automatic fallback — per-tier fallback chains when primary models fail

Security

Prompt injection detection — 252 patterns across 9 languages (EN, RU, ZH, KO, JA, AR, DE, FR, PT)
8 threat categories — prompt injection, data exfiltration, command injection, social engineering, secret leakage, SSRF, encoding evasion, file system attacks
Evasion-resistant — normalizer defeats base64, leet speak, zero-width characters, spaced letters, HTML obfuscation
Configurable threshold — block CRITICAL only, or WARNING and above
Security headers — X-TierFlow-Security: CLEAN|WARNING|BLOCKED on every response

Privacy

PII scrubbing — 15 detection patterns (emails, API keys, SSNs, credit cards, IPs, PEM keys, etc.)
Type-preserving placeholders — p0abc@maildomain.com for emails so LLMs maintain format
AES-256-GCM encryption — PII vault is encrypted in memory, never written to disk
Streaming-safe rehydration — works with SSE streaming responses
Per-provider control — enable PII scrubbing only for external providers, skip for local Ollama

Performance

CtxPack compression — 6-pass context compression (ANSI, whitespace, JSON, dedup, comments, stack traces), 30-70% token savings
Response cache — LRU with TTL, SHA-256 exact-match keys, X-Cache: HIT/MISS headers
Zero runtime dependencies — pure Node.js built-ins, ~2MB installed
Hot reload — POST /reload-config to update models and providers without restart

Observability

Web dashboard — built-in monitoring at /dashboard with auto-refresh
Request stats — per-tier, per-model, PII, cache, and cost tracking at /stats
Routing headers — X-TierFlow-Model, X-TierFlow-Tier, X-TierFlow-Reasoning on every response
Token cost tracking — real-time cost estimation per request

Developer Experience

OpenAI-compatible API — drop-in /v1/chat/completions proxy, works with any client
CLI — npx tierflow --init, --check, --port, --debug
Docker Compose — one command for router + ML classifier
Streaming support — full SSE pass-through with PII rehydration
Multi-provider — Anthropic, OpenAI, Ollama, OpenRouter, Groq, Together, Mistral, DeepSeek, and any OpenAI-compatible API

Configuration

TierFlow uses a single JSON config file:

npx tierflow --init    # generates ~/.config/tierflow/config.json

{
  "port": 18800,
  "host": "127.0.0.1",
  "providers": {
    "anthropic": {
      "baseUrl": "https://api.anthropic.com",
      "api": "anthropic",
      "auth": { "type": "env", "key": "ANTHROPIC_API_KEY" }
    },
    "openrouter": {
      "baseUrl": "https://openrouter.ai/api/v1",
      "api": "openai",
      "auth": { "type": "env", "key": "OPENROUTER_API_KEY" },
      "pii": true
    },
    "ollama": {
      "baseUrl": "http://localhost:11434/v1",
      "api": "openai",
      "auth": { "type": "none" }
    }
  },
  "categories": {
    "simple_chat": { "primary": "ollama/llama3.2", "fallback": ["openrouter/google/gemini-2.5-flash-lite"] },
    "coding": { "primary": "openrouter/qwen/qwen3-coder:free", "fallback": ["anthropic/claude-sonnet-4-5"] },
    "reasoning": { "primary": "anthropic/claude-opus-4-6", "fallback": ["openrouter/deepseek/deepseek-r1"] }
  },
  "cache": { "enabled": true, "ttl_seconds": 300 }
}

See docs/providers.md for the full provider cookbook.

Mode Overrides

Force a category when you know better than the classifier:

Prefix	Routes To	Example
`/simple`, `/basic`, `/cheap`	simple_chat	`/simple What's 2+2?`
`/code`, `/advanced`	coding	`/code Binary search in TypeScript`
`/max`, `/think`, `/deep`	reasoning	`/max Prove Bayes' theorem`
`/creative`	creative	`/creative Haiku about debugging`

Prefixes are stripped before forwarding — the LLM never sees them.

API Reference

Endpoint	Method	Description
`/v1/chat/completions`	POST	OpenAI-compatible chat endpoint
`/v1/models`	GET	List configured models
`/health`	GET	Health check + uptime + stats
`/stats`	GET	Detailed request statistics
`/config`	GET	Current config (secrets redacted)
`/reload-config`	POST	Hot-reload config + auth
`/dashboard`	GET	Web monitoring dashboard

Comparison

Feature	TierFlow	LiteLLM	OpenRouter	Portkey
Self-hosted	Yes	Yes	No (SaaS)	No (SaaS)
ML-powered routing	Yes (8 categories)	No	No	No
Prompt injection detection	Built-in (252 patterns, 9 languages)	No	No	No
PII scrubbing	Built-in (15 patterns)	No	No	No
Context compression	Built-in (30-70% savings)	No	No	No
Zero dependencies	Yes	No (200+)	N/A	N/A
Your API keys	Yes	Yes	Pooled	Pooled
Cost	Free (MIT)	Free (MIT)	Markup on usage	Markup on usage
Response caching	Built-in LRU	Redis/in-memory	No	No
Web dashboard	Built-in	Separate	Web app	Web app

Why TierFlow instead of LiteLLM?

LiteLLM is a mature, full-featured proxy with 100+ provider integrations and a large community. If you need broad provider support and don't mind the dependency footprint, it's a great choice.

TierFlow takes a different approach:

Smart routing — TierFlow classifies every request and routes to the cheapest capable model automatically. LiteLLM routes to whichever model you specify.
Zero dependencies — TierFlow is ~2MB with zero npm dependencies. LiteLLM installs 200+ Python packages.
Built-in PII scrubbing — auto-redact sensitive data before it leaves your infrastructure. Not available in LiteLLM.
Context compression — 30-70% token savings on verbose prompts. Not available in LiteLLM.

Choose LiteLLM if: you need 100+ provider integrations, Python ecosystem, or team management features.

Choose TierFlow if: you want automatic cost optimization, PII protection, and a lightweight self-hosted router with zero dependencies.

Architecture

tierflow/
├── src/
│   ├── server.ts          # HTTP server, routing, stats
│   ├── provider.ts        # Multi-provider forwarding + SSE translation
│   ├── config.ts          # Config loader + types
│   ├── auth.ts            # API key management (env, file, keychain)
│   ├── cli.ts             # CLI entry point
│   ├── dashboard.ts       # Built-in web dashboard
│   ├── router/            # ML classifier + 14-dimension fallback scorer
│   ├── pii/               # AES-256-GCM vault + type-preserving scrubber
│   ├── compress/          # 6-pass context compression (CtxPack)
│   └── cache/             # LRU response cache with TTL
├── test/                  # Unit + integration tests
├── bench/                 # Benchmark suite (100 prompts)
├── Dockerfile             # Multi-stage build
└── docker-compose.yml     # Router + ML classifier stack

Security Considerations

Localhost by default — binds to 127.0.0.1, not 0.0.0.0
No auth on management endpoints — /reload-config, /stats, /config are unauthenticated. This is safe on localhost but if you expose TierFlow on a network, place it behind a reverse proxy with authentication.
PII is memory-only — the encryption vault is never written to disk
API keys stay local — TierFlow reads your keys from environment variables and forwards them directly to providers. Keys are never logged, cached, or stored.

For responsible disclosure of security issues, see SECURITY.md.

Used in Production

TierFlow powers the routing layer of OpenClaw, a self-hosted AI agent platform running 24/7 on consumer hardware. Every request — WhatsApp messages, calendar commands, coding tasks — is classified and routed through TierFlow before reaching any LLM provider.

Real-world results on a MacBook Pro M2 server:

99% cost reduction — 20 real API calls cost $0.003 instead of $0.27 on Claude Opus
Projected savings: ~$406/month at 1,000 requests/day
Simple queries routed to Gemini Flash Lite ($0.00001/req), reasoning to GPT-OSS/DeepSeek
PII scrubbed from all requests sent to external providers

Credits

Routing engine originally forked from BlockRunAI/ClawRouter (MIT License). The 14-dimension keyword scorer is preserved and extended. Credit to BlockRunAI for the original classifier design.

Security scanning patterns from Claw Sentinel by oleglegegg (MIT License). 252 patterns covering prompt injection (9 languages), data exfiltration, command injection, and secret leakage.

Built on top: ML-powered 8-category routing, security scanner, PII scrubbing, CtxPack compression, response caching, agentic detection, web dashboard, CLI, and Docker support.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 268 Commits
.github		.github
assets		assets
bench		bench
bin		bin
demo		demo
docs		docs
src		src
test		test
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.npmignore		.npmignore
.prettierignore		.prettierignore
.prettierrc		.prettierrc
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
eslint.config.js		eslint.config.js
package-lock.json		package-lock.json
package.json		package.json
start-tierflow.sh		start-tierflow.sh
tierflow.config.example.json		tierflow.config.example.json
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cut Your LLM API Bill by 75% — Automatically

The Problem

The Solution

Quick Start

Other Install Options

How It Works

Two-Layer Classification

8 Routing Categories

Features

Smart Routing

Security

Privacy

Performance

Observability

Developer Experience

Configuration

Mode Overrides

API Reference

Comparison

Architecture

Security Considerations

Used in Production

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cut Your LLM API Bill by 75% — Automatically

The Problem

The Solution

Quick Start

Other Install Options

How It Works

Two-Layer Classification

8 Routing Categories

Features

Smart Routing

Security

Privacy

Performance

Observability

Developer Experience

Configuration

Mode Overrides

API Reference

Comparison

Architecture

Security Considerations

Used in Production

Credits

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages