AX Serving

The Offline Serving And Orchestration Plane For AX Fabric

AX Serving is the offline serving and orchestration layer behind AX Fabric. It provides OpenAI-compatible APIs, runtime model lifecycle control, scheduling, metrics, and multi-worker routing on Apple Silicon.

For inference execution, AX Serving uses:

llama.cpp by default for all model loads
ax-engine when explicitly requested via native backend override

AX Fabric is the product-facing layer for retrieval, knowledge, and grounded agent workflows. AX Serving is the infrastructure layer that makes that stack deployable and operable.

Status: production-ready Rust workspace for Apple Silicon (aarch64-apple-darwin) with OpenAI-compatible REST, gRPC, runtime model management, and optional multi-worker orchestration.

What AX Serving is:

the serving/control-plane subsystem behind AX Fabric
not the end-user product surface
not the low-level inference engine itself

What AX Serving is for:

single-node offline serving
multi-worker model routing and health-aware dispatch
operator-facing runtime control, diagnostics, metrics, and audit surfaces
OSS and Business deployments of the public repo

Editions

Jump to: OSS | Business

Capability	OSS	Business
OpenAI-compatible REST + gRPC serving	Yes	Yes
Runtime model load/unload/reload APIs	Yes	Yes
Scheduler controls, metrics, dashboard, and admin APIs	Yes	Yes
Benchmark/soak tooling (`ax-serving-bench`)	Yes	Yes
Single-node offline runtime	Yes	Yes
Multi-worker Mac Grid orchestration	No	Yes
Commercial licensing terms	No	Included
Optional support arrangements	No	By agreement

OSS

License: AGPL-3.0-only
Best for local builders, evaluation, and teams operating under OSS terms
Single-node serving surface in the public repo

Business

Includes everything in OSS
Available under commercial terms as an alternative to AGPL obligations
Adds multi-worker Mac Grid deployment in the public repo
License key activation via AXS_LICENSE_KEY or POST /v1/license

Private enterprise work is handled in a separate private project. This public repository covers OSS and Business.

Quick Start

Prerequisites:

Apple Silicon macOS
Rust toolchain
llama-server on PATH for llama.cpp fallback and explicit llama_cpp loads
a GGUF model file

Validate your environment:

cargo check --workspace
which llama-server

Backend model:

native = explicit ax-engine
llama_cpp = llama-server (default when backend is omitted)
auto = try native first, then llama.cpp on unsupported architectures

Start the simplest local runtime:

AXS_ALLOW_NO_AUTH=true \
cargo run -p ax-serving-cli --bin ax-serving -- serve \
  -m ./models/<model>.gguf \
  --model-id default \
  --host 127.0.0.1 \
  --port 18080

Send a request:

curl -sS http://127.0.0.1:18080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Give me three short points about Rust."}],
    "stream": false,
    "max_tokens": 96
  }'

For fuller setup paths, see QUICKSTART.md:

single runtime
authenticated offline deployment
gateway + workers
model management
embeddings

TypeScript SDK (Zod-validated):

cd sdk/javascript
npm install
npm run build

Why AX Serving

Most local runtimes focus on single-process inference. AX Serving focuses on the operational layer above inference:

OpenAI-compatible REST and gRPC serving
runtime model load/unload/reload
admission queueing and concurrency control
metrics, dashboard, diagnostics, and audit surfaces
multi-worker orchestration for Business deployments
benchmark and soak tooling in the same repo

Positioning:

AX Fabric is the product layer
AX Serving is the serving and orchestration layer underneath it
inference runtimes such as ax-engine and llama.cpp remain lower-level execution backends

Backend Architecture

AX Serving is not itself the token-generation engine. It is the serving layer that routes requests into lower-level runtimes.

llama.cpp is the default backend for model loading across families.
ax-engine remains an explicit opt-in path for environments that can benefit from native execution.
routing between those backends is controlled through config/backends.yaml

In practice, this means AX Serving owns the APIs, scheduling, orchestration, health, metrics, and model lifecycle, while model execution defaults to llama.cpp with ax-engine as an explicit override.

Best With AX Fabric

AX Serving is designed to work with AX Fabric as part of one complete system.

AX Serving: execution control plane, model lifecycle, routing, scheduling, APIs
AX Fabric: document ingestion, vector search, BM25/hybrid retrieval, MCP-native data access
Together: AX Fabric is the product layer; AX Serving is the execution layer underneath it

Core Capabilities

Capability	AX Serving
OpenAI-compatible chat/completions/embeddings	✅
Streaming SSE + non-streaming responses	✅
Runtime model management (`/v1/models`)	✅
Multi-worker orchestration (`ax-serving-api`)	✅
Dispatch policies (`least_inflight`, `weighted_round_robin`, `model_affinity`, `token_cost`)	✅
Scheduler queue/inflight controls	✅
Prometheus + JSON metrics	✅
Embedded dashboard (`/dashboard`)	✅
Built-in benchmarking (`ax-serving-bench`)	✅

Run Modes

1. Single Inference CLI

cargo run -p ax-serving-cli --bin ax-serving -- \
  -m ./models/<model>.gguf \
  -p "Hello from AX Serving" \
  -n 128

2. Single Runtime (`ax-serving serve`)

AXS_ALLOW_NO_AUTH=true \
cargo run -p ax-serving-cli --bin ax-serving -- serve \
  -m ./models/<model>.gguf \
  --model-id default \
  --port 18080

3. Gateway + Workers (`ax-serving-api` + workers, Business)

Gateway:

AXS_ALLOW_NO_AUTH=true \
cargo run -p ax-serving-cli --bin ax-serving-api -- \
  --port 18080 \
  --internal-port 19090 \
  --policy least_inflight

Worker:

AXS_ALLOW_NO_AUTH=true \
cargo run -p ax-serving-cli --bin ax-serving -- serve \
  -m ./models/<model>.gguf \
  --model-id default \
  --port 18081 \
  --orchestrator http://127.0.0.1:19090

API Surface

Serving runtime (`ax-serving serve`)

POST /v1/chat/completions
POST /v1/completions
POST /v1/embeddings
GET /v1/models
POST /v1/models
DELETE /v1/models/{id}
POST /v1/models/{id}/reload
GET /health
GET /v1/metrics
GET /metrics
GET /dashboard
GET /v1/license
POST /v1/license
GET /v1/admin/status
GET /v1/admin/startup-report
GET /v1/admin/diagnostics
GET /v1/admin/audit
GET /v1/admin/policy

Orchestrator (`ax-serving-api`, Business)

POST /v1/chat/completions
POST /v1/completions
POST /v1/embeddings
GET /v1/models
GET /health
GET /v1/metrics
GET /v1/license
POST /v1/license
GET /v1/admin/status
GET /v1/admin/startup-report
GET /v1/admin/diagnostics
GET /v1/admin/audit
GET /v1/admin/policy
GET /v1/admin/fleet
GET /v1/workers
GET /v1/workers/{id}
POST /v1/workers/{id}/drain
POST /v1/workers/{id}/drain-complete
DELETE /v1/workers/{id}

Runtime health contract:

GET /health is liveness plus readiness, not just process-up status
status=ok means the runtime is ready and at least one model is available
status=degraded means the process is alive but either no model is loaded or the runtime is thermally constrained

AX Fabric integration contract:

documented in docs/contracts/ax-fabric-runtime-contract.md

Admin/control-plane notes:

all authenticated admin responses preserve X-Request-ID
GET /v1/admin/status gives an operational summary
GET /v1/admin/startup-report and GET /v1/admin/diagnostics are for runtime inspection
worker inventory and drain APIs are orchestrator-only

v1.4 Runtime Controls

AXS_SPLIT_SCHEDULER=true
- enables prefill/decode activity tracking in scheduler metrics
AXS_MAX_BATCH_SIZE / AXS_BATCH_WINDOW_MS
- currently advisory scheduler hints only
- they are exposed for future scheduler work and do not drive a batching loop today

Relevant scheduler metrics:

prefill_tokens_active
decode_sequences_active
split_scheduler_enabled

Authentication

If AXS_API_KEY is set, protected endpoints require bearer auth.
If AXS_API_KEY is unset, startup requires AXS_ALLOW_NO_AUTH=true.

Recommended offline enterprise startup:

AXS_CONFIG=config/serving.offline-enterprise.yaml \
AXS_API_KEY="change-me" \
AXS_MODEL_ALLOWED_DIRS="/absolute/path/to/models" \
cargo run -p ax-serving-cli --bin ax-serving -- serve \
  -m /absolute/path/to/models/<model>.gguf \
  --model-id default

AXS_API_KEY="token1,token2" cargo run -p ax-serving-cli --bin ax-serving -- serve -m ./models/<model>.gguf

Client header:

Authorization: Bearer token1

Build, Lint, Test

cargo check --workspace
cargo fmt --all -- --check
cargo clippy --workspace --tests -- -D warnings
cargo test --workspace

Integration tests (no model required — uses in-process mock servers):

AXS_ALLOW_NO_AUTH=true cargo test -p ax-serving-api --test orchestration
AXS_ALLOW_NO_AUTH=true cargo test -p ax-serving-api --test model_management
AXS_ALLOW_NO_AUTH=true cargo test -p ax-serving-api --test graceful_shutdown

Release build:

cargo build --workspace --release

Test Coverage

All tests run automatically in CI on every push and pull request against main. No model file or GPU is required — tests use in-process backends (NullBackend, EchoBackend, FailingUnloadBackend) that exercise the full request path without hardware.

Exact test counts change over time. Use the linked CI badge and workflow runs as the source of truth.

Suite	What It Covers
Unit — serving API	Scheduler (permits, AIMD, TTFT histogram, split prefill/decode), model registry (lifecycle, idle eviction, capacity), orchestration (queue, dispatch policies, worker registry, DashMap), REST helpers (cache key normalisation, cache hit ratio), config (env layering, validation), gRPC status mapping, auth, metrics
Unit — engine	Backend routing, GGUF metadata parsing, thermal state, memory budget
Unit — C shim	Null-safe llama.h ABI compatibility
Integration — model_management	Auth (Bearer, whitespace tolerance, 401+WWW-Authenticate), model load/unload/reload (201/200/409/404/503), health semantics (ok/degraded/critical-thermal/no-models), input validation (400/422 on every field), full inference path (chat + completions via EchoBackend), embeddings, security response headers, metrics JSON keys, dashboard HTML, license GET/SET
Integration — orchestration	Worker register/heartbeat/eviction, dispatch (least-inflight, weighted round-robin, model-affinity, token-cost), queue admission and backpressure, reroute on 5xx, chaos (all workers fail → 503), overload (queue full → 429)
Integration — graceful_shutdown	In-flight request drains to completion before server exits

Every CI run posts a test summary to the GitHub Actions job summary page — see the Actions tab for per-run results.

Benchmarking

cargo run -p ax-serving-bench --release -- bench -m ./models/<model>.gguf

Other benchmark modes:

profile
mixed
cache-bench
soak
compare
regression-check
multi-worker

Repository Layout

crates/ax-serving-engine: backend abstraction, routing, model internals
crates/ax-serving-api: REST/gRPC serving, scheduler, orchestration
crates/ax-serving-cli: ax-serving and ax-serving-api binaries
crates/ax-serving-bench: benchmark and soak runners
crates/ax-serving-shim: C-compatible shim
crates/ax-serving-py: Python bindings
config/: serving and routing configuration
docs/: runbooks and architecture notes

Documentation

QUICKSTART.md
docs/contracts/ax-fabric-runtime-contract.md
sdk/javascript/README.md (TypeScript SDK with Zod validation)
sdk/python/ (Python SDK)
docs/runbooks/multi-worker.md
docs/perf/service-tuning.md

Licensing

Open-source terms: AGPL v3 text and licensing guide
Commercial terms: commercial license
Issue reporting policy: CONTRIBUTING.md

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.cargo		.cargo
.github/workflows		.github/workflows
benchmarks		benchmarks
config		config
crates		crates
docs		docs
include		include
packaging		packaging
proto		proto
scripts		scripts
sdk		sdk
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
LICENSE-COMMERCIAL.md		LICENSE-COMMERCIAL.md
LICENSING.md		LICENSING.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
cbindgen.toml		cbindgen.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AX Serving

Editions

Quick Start

Why AX Serving

Backend Architecture

Best With AX Fabric

Core Capabilities

Run Modes

1. Single Inference CLI

2. Single Runtime (`ax-serving serve`)

3. Gateway + Workers (`ax-serving-api` + workers, Business)

API Surface

Serving runtime (`ax-serving serve`)

Orchestrator (`ax-serving-api`, Business)

v1.4 Runtime Controls

Authentication

Build, Lint, Test

Test Coverage

Benchmarking

Repository Layout

Documentation

Licensing

About

Licenses found

Uh oh!

Releases 16

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AX Serving

Editions

Quick Start

Why AX Serving

Backend Architecture

Best With AX Fabric

Core Capabilities

Run Modes

1. Single Inference CLI

2. Single Runtime (ax-serving serve)

3. Gateway + Workers (ax-serving-api + workers, Business)

API Surface

Serving runtime (ax-serving serve)

Orchestrator (ax-serving-api, Business)

v1.4 Runtime Controls

Authentication

Build, Lint, Test

Test Coverage

Benchmarking

Repository Layout

Documentation

Licensing

About

Topics

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 16

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

2. Single Runtime (`ax-serving serve`)

3. Gateway + Workers (`ax-serving-api` + workers, Business)

Serving runtime (`ax-serving serve`)

Orchestrator (`ax-serving-api`, Business)

Packages